In computational science, the scale of problems addressed and the resolution of solutions achieved are often limited by the available computational capacity. The current methodology of scaling computational capacity to large scale (i.e. larger than individual resource site capacity) includes aggregation and federation of distributed resource systems. Regardless of how this aggregation manifests, scaling of scientific computational problems typically involves (re)formulation of computational structures and problems to exploit problem and resource parallelism. Efficient parallelization and scaling of scientific computations to large scale is difficult and further complicated by a number of factors introduced by resource aggregation, e.g., resource heterogeneity and coupling of computational methodology. Scaling complexity severely impacts computation enactment and necessitates the use of mechanisms that provide higher abstractions for management of computations in distributed computing environments. This work addresses design and construction of virtual infrastructures for scientific computation that abstract computation enactment complexity, decouple computation specification from computation enactment, and facilitate large-scale use of computational resource systems. In particular, this thesis discusses job and resource management in distributed virtual scientific infrastructures intended for Grid and Cloud computing environments. The main area studied is Grid computing, which is approached using Service-Oriented Computing and Architecture methodology. Thesis contributions discuss both methodology and mechanisms for construction of virtual infrastructures, and address individual problems such as job management, application integration, scheduling job prioritization, and service-based software development. I addition to scientific publications, this work also makes contributions in the form of software artifacts that demonstrate the concepts discussed. The Grid Job Management Framework (GJMF) abstracts job enactment complexity and provides a range of middleware-agnostic job submission, control, and monitoring interfaces. The FSGrid framework provides a generic model for specification and delegation of resource allocations in virtual organizations, and enacts allocations based on distributed fairshare job prioritization. Mechanisms such as these decouple job and resource management from computational infrastructure systems and facilitate the construction of scalable virtual infrastructures for computational science.
Page Responsible: Frank Drewes 2024-11-21