Research On Big Data Algorithm And Optimization Technology Of Metagenomics Analysis

Posted on:2017-09-13

Degree:Master

Type:Thesis

Country:China

Candidate:X Gu

Full Text:PDF

GTID:2370330569998829

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Human production and life are inseparable from the environment of microorganisms,and the emergence of metagenomics provide a possibility to research microbial species that can not be culturing in the laboratory purely.However,there are many bottlenecks in bioinformatics analysis,and more than 70% of metagenomic sequences are aligned and clustered in the metagenomics analysis process.In view of these two typical problems,this paper respectively from the parallel optimization and algorithm improvement point of view in-depth study.Specific work is as follows:1.SOAPaligner is a large-scale sequence alignment software developed by BGI and widely used in its production line.The software faces problems such as insufficient memory and long running times in the face of vast amounts of metagenomic sequence data and growing reference genomes.Aiming at these problems,this paper designs a new parallel optimization method and task partitioning strategy,and carries on the experimental test on Hadoop Streaming platform.The test results show that the parallelization program based on Hadoop platform achieves more than 10 times speedup,and the program has good scalability.The single-sample metagenomic sequence data(reference genome 15GB),the original need for about 6 hours of sequence alignment time is reduced to less than 20 minutes.2.The main computational method of the binning problem of the metagenome MGS clustering method faces the millions of order of magnitude gene,will encounter the iteration times is too many,can not identify the noise gene and so on the question to cause this clustering method time and space efficiency is low.In this paper,the metagenomic clustering problem is divided into two subproblems: gene similarity computation and metagenome binning based on graph.In this paper,a pair of similar gene pair algorithms based on Spark is proposed for the similarity pair computation problem,and the algorithm is accelerated by local sensitive hash(LSH).The results show that the method can obtain 5 to 14 times speedups when LSH is used for the similarity gene pair calculation under the condition of guaranteeing the accuracy of the results,and the method has good expansibility.3.Graph-based gene binning is is proposed,which can filter out the noise genes in the massive gene relatively to the previous MGS clustering method or other clustering methods,and can directly use the distributed parallel graph processing framework Spark GraphX to achieve the parallel work of binning.In this paper,we present a complete solution based on Spark GraphX framework.The whole process can be divided into three steps: graph generation,vertex degree analysis,graph connectivity analysis and graph clustering.In the Spark GraphX framework of these analytical work can be completed in a few minutes.

Keywords/Search Tags:

Metagenomics, Sequence Alignment, Similar Gene Pair, Big Data, Hadoop Streaming, Spark, Locality-Sensitive Hashing, GraphX

PDF Full Text Request

Related items

1	Alignment-free Sequence Similarity Analysis And Clustering Algorithms On Biological Sequences
2	Research And Implementation Of Sequence Alignment Algorithm For Gene Large Data Based On Hadoop
3	The Parallelization Research Of Genomics Data Comparison Algorithm And The Construction Of Comparison Platform Based On Spark
4	Research On The Third-generation DNA Sequencing Data Compression Method
5	The Research And Implementation Of The Distributed Parallel Blast Algorithm That Is A Gene Sequence Alignment Algorithm Based On Hadoop Platform
6	Gene Sequence Alignment Algorithm Research And Implement In SNP
7	Research On Physical Marine Big Data Cloud Computing Technology Based On Spark
8	A Method Of Sanitizing A Privacy-sensitive Mobility Knowledge Network Of Trajectory Data Based On A Spark Platform
9	Ultra-large Multiple Sequence Alignment Based On Distributed Computing
10	Study Of Several Algorithms For Alignment Problem Of Sequence And Sequence Secondary Structure