The Application Case Study Of Mapreduce Parallel Computation And The Optimization Of Its Runtime Framework

Posted on:2013-01-27

Degree:Master

Type:Thesis

Country:China

Candidate:X L Yang

Full Text:PDF

GTID:2248330371987902

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

In business area, scientific area, and our life, the data we have produced today are increasing in an astonishing speed. The traditional data store and processing techniques and tools, represented by relational databases systems, are not able to capture, manage and process such large volume of and extremely fast accumulated data. Big data contain more useful information, while bring up more challenges at the mean time. Big data processing techniques have become the focus of current research. Under such a background, it has become a consensus from industry to academia that parallel computing technique is the only way to deal with big data. However, the fact that parallel computing technique is always intimately bounded up with applications, and the diversity of applications as well, have been preventing us from reaching a common or unified parallel computational model or framework.The MapReduce technique originally published by Google, has become the most successful technique for big data processing because of its high scalability and ease of use, and has gained wide application. Hadoop, the mainstream open-source implementation of MapReduce, is the de-facto industrial standard of big data processing. However, the current implementation is overly suitable for large scale batch processing; the high response demand from many real applications, like online data processing or queries, is ignored. To solve this problem, we dived into the MapReduce execution framework, and have made some optimizations. Our main contributions are the following two points:(1) An application case study of MapReduce parallel computational model. We took BLAST, the most famous sequence alignment tool in bioinformatics, as the object. We have analyzed the data partition and computation partition problems, and given two methods based on MapReduce model for its parallelizationo We have done a lot of experiments to evaluate and compare the two methods.(2) The optimization of MapReduce execution framework. Through an analysis of the time overhead of a MapReduce job and an anatomy of the MapReduce execution framework, we propose two optimizations:the first is to move the work of job setup and job cleanup from TaskTracker to JobTracker, to reduce the time overhead of job preparation and clean up; the second is to change the tasks assignment scheme form pull to push, and separate the task state changed event messages from the fixed time period heartbeat, transmitting them instantly. At last, we use the application, BLAST, in our final experiments on the performance evaluation of our optimization methods, and the experimental results show that our methods are very effective.

Keywords/Search Tags:

parallel computing, big data processing, sequence alignment, MapReduce optimization

PDF Full Text Request

Related items

1	Parallel Design And Optimization Of Sequence Alignment Algorithm Based On APU
2	Study On Parallel Processing For Sequence Alignment On Heterogeneous Cluster Computing Systems
3	Optimization Of Sequence Alignment Parallel Software On CUDA
4	Design And Optimization Of Parallel Algorithm For Biolgogical Sequences
5	Research On Optimization Technology Of Data Parallel Processing Based On MapReduce
6	Research On Parallel Processing Technology Of Sequence Analysis
7	GPU data-parallel computing of sequence alignment using CUDA
8	Parallel Bio-computing Research And Implementation Of The Network Environment
9	Research On Efficient Task Partition And Scheduling In MapReduce Data Processing System
10	The Research And Implementation Of Diversity Demand Oriented Parallel Computing Model