Font Size: a A A

The Application Case Study Of Mapreduce Parallel Computation And The Optimization Of Its Runtime Framework

Posted on:2013-01-27Degree:MasterType:Thesis
Country:ChinaCandidate:X L YangFull Text:PDF
GTID:2248330371987902Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In business area, scientific area, and our life, the data we have produced today are increasing in an astonishing speed. The traditional data store and processing techniques and tools, represented by relational databases systems, are not able to capture, manage and process such large volume of and extremely fast accumulated data. Big data contain more useful information, while bring up more challenges at the mean time. Big data processing techniques have become the focus of current research. Under such a background, it has become a consensus from industry to academia that parallel computing technique is the only way to deal with big data. However, the fact that parallel computing technique is always intimately bounded up with applications, and the diversity of applications as well, have been preventing us from reaching a common or unified parallel computational model or framework.The MapReduce technique originally published by Google, has become the most successful technique for big data processing because of its high scalability and ease of use, and has gained wide application. Hadoop, the mainstream open-source implementation of MapReduce, is the de-facto industrial standard of big data processing. However, the current implementation is overly suitable for large scale batch processing; the high response demand from many real applications, like online data processing or queries, is ignored. To solve this problem, we dived into the MapReduce execution framework, and have made some optimizations. Our main contributions are the following two points:(1) An application case study of MapReduce parallel computational model. We took BLAST, the most famous sequence alignment tool in bioinformatics, as the object. We have analyzed the data partition and computation partition problems, and given two methods based on MapReduce model for its parallelizationo We have done a lot of experiments to evaluate and compare the two methods.(2) The optimization of MapReduce execution framework. Through an analysis of the time overhead of a MapReduce job and an anatomy of the MapReduce execution framework, we propose two optimizations:the first is to move the work of job setup and job cleanup from TaskTracker to JobTracker, to reduce the time overhead of job preparation and clean up; the second is to change the tasks assignment scheme form pull to push, and separate the task state changed event messages from the fixed time period heartbeat, transmitting them instantly. At last, we use the application, BLAST, in our final experiments on the performance evaluation of our optimization methods, and the experimental results show that our methods are very effective.
Keywords/Search Tags:parallel computing, big data processing, sequence alignment, MapReduce optimization
PDF Full Text Request
Related items