Font Size: a A A

Research On Big Data Algorithm And Optimization Technology Of Metagenomics Analysis

Posted on:2017-09-13Degree:MasterType:Thesis
Country:ChinaCandidate:X GuFull Text:PDF
GTID:2370330569998829Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Human production and life are inseparable from the environment of microorganisms,and the emergence of metagenomics provide a possibility to research microbial species that can not be culturing in the laboratory purely.However,there are many bottlenecks in bioinformatics analysis,and more than 70% of metagenomic sequences are aligned and clustered in the metagenomics analysis process.In view of these two typical problems,this paper respectively from the parallel optimization and algorithm improvement point of view in-depth study.Specific work is as follows:1.SOAPaligner is a large-scale sequence alignment software developed by BGI and widely used in its production line.The software faces problems such as insufficient memory and long running times in the face of vast amounts of metagenomic sequence data and growing reference genomes.Aiming at these problems,this paper designs a new parallel optimization method and task partitioning strategy,and carries on the experimental test on Hadoop Streaming platform.The test results show that the parallelization program based on Hadoop platform achieves more than 10 times speedup,and the program has good scalability.The single-sample metagenomic sequence data(reference genome 15GB),the original need for about 6 hours of sequence alignment time is reduced to less than 20 minutes.2.The main computational method of the binning problem of the metagenome MGS clustering method faces the millions of order of magnitude gene,will encounter the iteration times is too many,can not identify the noise gene and so on the question to cause this clustering method time and space efficiency is low.In this paper,the metagenomic clustering problem is divided into two subproblems: gene similarity computation and metagenome binning based on graph.In this paper,a pair of similar gene pair algorithms based on Spark is proposed for the similarity pair computation problem,and the algorithm is accelerated by local sensitive hash(LSH).The results show that the method can obtain 5 to 14 times speedups when LSH is used for the similarity gene pair calculation under the condition of guaranteeing the accuracy of the results,and the method has good expansibility.3.Graph-based gene binning is is proposed,which can filter out the noise genes in the massive gene relatively to the previous MGS clustering method or other clustering methods,and can directly use the distributed parallel graph processing framework Spark GraphX to achieve the parallel work of binning.In this paper,we present a complete solution based on Spark GraphX framework.The whole process can be divided into three steps: graph generation,vertex degree analysis,graph connectivity analysis and graph clustering.In the Spark GraphX framework of these analytical work can be completed in a few minutes.
Keywords/Search Tags:Metagenomics, Sequence Alignment, Similar Gene Pair, Big Data, Hadoop Streaming, Spark, Locality-Sensitive Hashing, GraphX
PDF Full Text Request
Related items