Font Size: a A A

Parallel Implementation For Last Based On Hadoop Streaming

Posted on:2015-06-04Degree:MasterType:Thesis
Country:ChinaCandidate:W H LiFull Text:PDF
GTID:2298330434951166Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, IOT (internet of things) and cloud computing related technology, a lot of scientific data show trends of rapid growth. It becomes a major problem to use the mature software in various fields to analyze data. In many scientific fields, mature software are still stand-alone software which can’t keep the pace of growing volumes of data. However, by parallelizing those mature stand-alone software on cloud platforms will be able to effectively solve this problem.In order to realize parallelization of stand-alone software applications, their source codes always have to be analyzed and the input file structure should be converted, which costs much time and leads to the long development period. This paper describes a model, with which parallelization deployment of stand-alone software applications can be realized quickly on the platform Hadoop without changing any source code or the input file structure. By using the programming tool of Hadoop Streaming provided by Hadoop, the model has realized parallelization of Last, a sequence alignment application, on Hadoop, which provides reference for other similar problems.Main contents:First, it analyzes the technology and principles involving in the Last software parallelization process, focusing on the Last comparison principle, Hadoop distribution platform and Lustre clusters file system. Second, it designs the parallelization model based on HDFS, which provides appropriate input data that meet the constraints by modifying InputFormat. At the same time, related Mapper script is designed to wrap up the Last comparison software which could run transparently on the Hadoop platform. Third, a parallelization model based on Lustre system is designed which will provide index for the input data by designing index algorithm to make sure each sub-task could quickly obtain data slice needed by the main task. It controls the parallelization granularity by designing the related Mapper and Reducer script that wrap up the Last comparison software and reconstructing the partition class Partitioner. At last, related experiments are designed to verify the feasibility, validity and accuracy of the above parallelization.
Keywords/Search Tags:Hadoop, Parallelization, Last, Cloud Computing, InputFormat
PDF Full Text Request
Related items