Job Scheduling Technologies In Data Intensive Supercomputing Systems

Posted on:2012-08-20

Degree:Master

Type:Thesis

Country:China

Candidate:Y X Chen

Full Text:PDF

GTID:2218330362460363

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

According to the expirence of industry production design, and the analysis of the faulty of supercomputing, academic circle comes up with Data-Intensive Supercomputing, a new parallel approaching dealing with large scale data. There are two features about Data-Intensive Supercomputing, first, the time of computing is proportional to the scale of data and second, send computation to data, not data to computation, also named as data locality. Data cluster built to incarnate Data-Intensive Supercomputing can offer service as the"cloud"in cloud computing.One of the prototype of Data-Intensive Supercomputing proposed by academic is Google MapReduce, then comes Hadoop, designed by open source organization, based on Google MapReduce. After then, people have done plenty of research on job scheduling in Hadoop cluster, the main goal is to solve the Straggler problem, which means some nodes have significantly long running time than other nodes. It is complex about the reason of Straggler, which may be wreaked by faulty in machines or networks, or dataset partition.We maintain that the imbalance of partitioning data over low key entropy space is a non-trivial reason for Straggler. Up to now theres is no ideal solutions. In this paper, we propose a runtime load balance mechanism to balance computing when a job is running; to lower the probability of Straggler comes up. Based on this mechanism, in order to reduce the whole running time of a job, we developed a data locality enhancement mechanism, according to the principle of data locality. We implement a prototype over iterative Hadoop, also known as HaLoop, and evaluate each mechanism. The expirement shows that the runtime load balance mechanism can balance computing effectively, and data locality enhancement mechanism can reduce job running time significantly in a friendly circumstance.

Keywords/Search Tags:

Data-Intensive, Supercomputing, Cloud Computing, MapReduce, Hadoop

PDF Full Text Request

Related items

1	Design And Implementation Of A Data Integration System Based On MapReduce
2	Research On Optimization Of Map Reduce For Interactive Analysis On Big Data
3	Performance-Aware Scheduling For Data-Intensive Cloud Computing
4	The Research Of MapReduce Job Scheduling Algorithm Based On The Hadoop Platform
5	The Design Of The Cloud Computing System Based On Hadoop
6	Researches About Cloud Computing And Expolit And Test Hadoop Program
7	Design And Implementation Of Visual Data Platform Based On MapReduce
8	The Mapreduce Model In The Hadoop Implementation Of Performance Analysis And Optimization Improvements
9	Scalable parallel computing on clouds: Efficient and scalable architectures to perform pleasingly parallel, MapReduce and iterative data intensive computations on cloud environments
10	The Research Of Mapreduce Implementing Of Text Classification Algorithm Based On Mass Data