The Research And Implementation Of The Distributed Parallel Blast Algorithm That Is A Gene Sequence Alignment Algorithm Based On Hadoop Platform

Posted on:2016-12-16

Degree:Master

Type:Thesis

Country:China

Candidate:M Meng

Full Text:PDF

GTID:2180330464963991

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

As an interdisciplinary of life science and computer science, bioinformatics is one of the most important and cutting-edge discipline in scientific developing field at present, and its development trend has been changed from the accumulation of the sequence data to the analysis of the sequence data. The sequence alignment is the most basic and the most important research content in the analysis of the biological sequence. How to dig out the useful information from the sequence data, which has great research theory value and practical application value, is a hot topic of current research. The current sequence data are stored in the sequence database, therefore, this paper focuses on the sequence alignment algorithm-Blast algorithm that is the most widely used and searches the similar sequence database.Nowadays, the generating sequence data expands in an explosive speed, the time complexity of the Blast algorithm is closely related to the size of the gene database, in order to solve problems caused by massive sequence data and improve the processing efficiency of biological research.This paper selects the Hadoop that is open source cloud computing platform to make Blast algorithm distributed parallel. In this paper, the Hadoop platform that contains sixteen nodes is built on the vSphere virtualization platform. This paper implements the pretreatment of the query sequence based on A-C automaton and the sequence database based on HDFS. This paper devises three scheme of the Blast algorithm distributed parallel based on MapReduce, which combines the characteristics of intensive calculation and independent data of Blast algorithm. In addition, this paper also optimizes the static parameter of Hadoop based on the characteristics of sequence, gets the optimal block size and the number of data copies.Finally, this paper designs many groups of experiments of sequence alignment, and analyses the result in detail. The experimental results shows that the third scheme that the matching and half expansion in map function and the half expansion in reduce function is best form integrated view. In addition, the block size and the copy number of the data block are respectively 128MB and 5. optimization is valid, and the improvement of actual performance is significant. In a word, the proposed optimization is valid, and the improvement of actual performance is significant.

Keywords/Search Tags:

Sequence alignment, Blast, HDFS, MapReduce, Static optimal adjustment

PDF Full Text Request

Related items

1	Biological Sequences Alignment
2	Bioinformatics Platform To Build And Sequence Alignment Algorithm Study
3	Research On Distributed Spatial Connection Query Based On MapReduce
4	Research On Distributed Storage And Sequence Alignment Of DNA Data Based On HBase
5	Study Of Several Algorithms For Alignment Problem Of Sequence And Sequence Secondary Structure
6	Structure Of Local Sequence Database And Research Of Loop Visualization Of Mammals' DNA
7	The Parallelization Research Of Genomics Data Comparison Algorithm And The Construction Of Comparison Platform Based On Spark
8	Application And Study Of Optimal Methods In Bio-Sequences Alignment
9	Parallel Optimization Design And Implementation Of Biological Sequence Alignment Algorithm
10	Development And Assembly For Several DP Based Sequence Alignment Algorithm Components