Font Size: a A A

The Research And Implementation Of The Distributed Parallel Blast Algorithm That Is A Gene Sequence Alignment Algorithm Based On Hadoop Platform

Posted on:2016-12-16Degree:MasterType:Thesis
Country:ChinaCandidate:M MengFull Text:PDF
GTID:2180330464963991Subject:Software engineering
Abstract/Summary:PDF Full Text Request
As an interdisciplinary of life science and computer science, bioinformatics is one of the most important and cutting-edge discipline in scientific developing field at present, and its development trend has been changed from the accumulation of the sequence data to the analysis of the sequence data. The sequence alignment is the most basic and the most important research content in the analysis of the biological sequence. How to dig out the useful information from the sequence data, which has great research theory value and practical application value, is a hot topic of current research. The current sequence data are stored in the sequence database, therefore, this paper focuses on the sequence alignment algorithm-Blast algorithm that is the most widely used and searches the similar sequence database.Nowadays, the generating sequence data expands in an explosive speed, the time complexity of the Blast algorithm is closely related to the size of the gene database, in order to solve problems caused by massive sequence data and improve the processing efficiency of biological research.This paper selects the Hadoop that is open source cloud computing platform to make Blast algorithm distributed parallel. In this paper, the Hadoop platform that contains sixteen nodes is built on the vSphere virtualization platform. This paper implements the pretreatment of the query sequence based on A-C automaton and the sequence database based on HDFS. This paper devises three scheme of the Blast algorithm distributed parallel based on MapReduce, which combines the characteristics of intensive calculation and independent data of Blast algorithm. In addition, this paper also optimizes the static parameter of Hadoop based on the characteristics of sequence, gets the optimal block size and the number of data copies.Finally, this paper designs many groups of experiments of sequence alignment, and analyses the result in detail. The experimental results shows that the third scheme that the matching and half expansion in map function and the half expansion in reduce function is best form integrated view. In addition, the block size and the copy number of the data block are respectively 128MB and 5. optimization is valid, and the improvement of actual performance is significant. In a word, the proposed optimization is valid, and the improvement of actual performance is significant.
Keywords/Search Tags:Sequence alignment, Blast, HDFS, MapReduce, Static optimal adjustment
PDF Full Text Request
Related items