Font Size: a A A

CSF Meta-scheduler In Bioinformatics Grid: Application And Research

Posted on:2009-04-01Degree:MasterType:Thesis
Country:ChinaCandidate:Y LuoFull Text:PDF
GTID:2178360242980636Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
The life science community is experiencing a period of unprecedented change, challenge and opportunity. With the completion of the sequencing of the human genome, the opportunities of scientific research based on computer simulation offer a bunch of possibilities: targeting and analysis of vast array of biological data sets; identification of genetic factors to disease causes through to complete biological understanding and tailored genetic treatments supporting e-Health solutions. In this field, as the increasing demand of high-performance computing and distributed data storage, one single high-performance computer or one single cluster along with large data storage system can not be better solutions. Meanwhile, in the past several years, Grid Computing has emerged as a way to harness and take advantages of computing resources across geographies and organizations. Grid and Grid Computing paradigm have the potential to become the model for the standard cyber-infrastructure for life science research. The Grid represents one way in which such an infrastructure can be developed and supported, providing bioinformatics scientists seamless access to computational and data resources and management of large scale bio-data sets, coupled with solutions overcoming cross-organizational resource sharing.In order to deliver cyber-infrastructure to the general scientific and bioinformatics research community, transparent access and ease of use are of critical importance. The increased level of sophistication requires that cyber-infrastructure developers either work closely with the applications scientists, or develop middleware that flattens the learning curve for these scientists to use the grid willingly and transparently. Many life sciences researchers prefer to run applications in the grid environment without modifications, and without knowledge of specific computational resources being utilized. They are stumbled by the learning curve and porting costs to the grid, while they are"crying for"the using of grid infrastructure.CSF4 (Community Scheduler Framework 4.0), developed by our lab in Jilin University, is the first WSRF compliant meta-scheduler, and released as an execution management component of Globus Toolkit 4. It provides a bunch of services, such as job service, queue service, reservation service and resource management service, through which CSF4 can experiment the grid-level scheduling. Using CSF4, grid users are able to access different local resource managers, such as LSF, PBS, Condor and SGE, via a single interface. CSF4 has been designed to support a scheduler plug-in model to facilitate the implementation of customized scheduling policies. Also, CSF4 provides a flexible user proxy delegation mechanism so that a job is able to run with a full proxy to access the grid services with strict security requirement like Gfarm.Quite frequently, life sciences applications are"pleasantly parallel", i.e., serial applications which may be used to handle many parallel data input. For example, AutoDock may be used to dock different ligands to a target protein structure, or Blast may be used with different input sequences to search for potentially related sequences within a target database. Normally AutoDock or Blast applications consist of a large number of sub-jobs. These sub-jobs execute same binary with different input/output files. The meta-scheduling objective is to balance the load between clusters and complete the entire set of jobs as soon as possible. Here we show how a customized scheduling policy for these applications may be developed using the CSF4 plug-in model. A user submits the job of these applications only once, the meta-scheduler generates all the sub-jobs automatically. A testbed of three clusters is setup with the BLAST deployed, each of which is with different host numbers and dynamic local work load. With the array job plug-in, a better balancing of work load is achieved.Moreover, current grid services accesses are mainly achieved by command line input. And we broader these access by using portal technologies, which is able to generate graphic interface to end users. A grid portal enables the transition to an applications environment, where the actions of the underlying middleware are completely transparent to the end user. This transparency can now be reflected across the user interface components--portlets. Quite simply, we have enough interfaces to middleware tools that simply replace command-line arguments. In this thesis, we build for bioinformatics scientists a CSF Portlet based on GridSphere Portal Framework. CSF Portlet, first carried out in 2006, is a java based web application for dispatching jobs to remote job schedulers, through a web browser, without understanding the underlying Grid services. It is not only a representation of CSF4 itself, but also a generic interface for users to submit jobs, to view job specifications and job history, to monitor job status, and also to get job output from remote sites using GridFTP. Additionally, CSF Portlet use GridPortlets to provide methods for keeping track of the GSS credentials for a given user either from GAMA server or directly from MyProxy server. It can not only automatically generate RSL script for job submission but also provide interface for experienced users to write RSL script by themselves. In this case, CSF Portlet supports the integration of CSF4 and Gfarm file system and allows users to decide which delegation mode (full/limited) should be taken. Further more, CSF Portlet is always able to find appropriate ways to dispatch specific jobs, especially to dispatch data-intensive jobs, compute-intensive jobs and MPI jobs.Here we report, through the collaboration between Jilin University(JLU) and University of California at San Diego(UCSD), the latest advances in the use of Gfarm as a computational data grid, with CSF4 as the meta-scheduler which accompanied with its Array-job schedule plug-in for biomedical applications such as BLAST, and GAMA (Grid Account Management Architecture) as grid credential manager, through a GridSphere portal based environment, to build a transparent grid environment for biomedical community, termed My WorkSphere. At present, My WorkSphere has already been deployed in San Diego Supercomputer center as a computational platform of NBCR and one of the TeraGrid Science Gateways.We are currently working towards several goals. On the one hand, we are planning to enhance the CSF4 plug-in mechanisms for better implementation of customized scheduling policies. One the other hand, we are still lack of making scheduling policy development. For instance, we have designed a resource co-allocation model for cross-domain parallel jobs, but have yet to make optimizations for cross-domain parallel jobs. How to assign a parallel job cross clusters has still been a challenged work, as we are responsible for collection and analysis of workload from local schedulers with high efficiency. Moreover, CSF Porlet version 1.0.0 is not totally JSR168 compliant but using GridSphere API. We are planning to turn CSF Portlet version 1.0.1 into 100% JSR168 compliant Portlet in the following months. As My WorkSphere has already been deployed, the main incoming concerns are system maintains and software upgrade along with the compatibility issues.
Keywords/Search Tags:Meta-scheduler
PDF Full Text Request
Related items