Research And Implementation On The Human Whole-genome Sequencing Data Processing Technology Based On Hadoop

Posted on:2016-07-03

Degree:Master

Type:Thesis

Country:China

Candidate:J J Lin

Full Text:PDF

GTID:2180330461483283

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the development of Next-generation sequencing(NGS) technologies, we have the ability to complete a person’s whole genome sequence in a very short period of time. This heralds the era of personalized medicine based on genetic information has arrived. However, the NGS sequencing platforms can get billions of DNA sequence reads in a single sequencing, and these sequences are equivalent to hundreds of billions of bytes of data. At the same time, the generation of sequencing data quantity and sequencing cost is rapidly increasing or decreasing at the speed far exceeding Moore’s law respectively. Storage and analysis of these data are faced with great challenge. Sequencing data has the characteristics of various forms and source, so the pretreatment, management and analysis of these data are beyond the reach of many biological information scientists. In order to carry out large-scale data analysis and efficient mining, high information integration and the perfect combination of various tools are requierd.At present, the most widely used sequencing data analysis tools are Burrows-Wheeler Aligner(BWA) and genome analysis toolkit(GATK). BWA is a sequence alignment tool, which has the characteristics of high accuracy and high speed, and GATK is a kind of call variants tools, which is widely used because of its high accuracy. Now Hadoop is the most popular solution for processing big data. For “big data” like personal whole genome sequencing, Hadoop distributed framework which have the characteristics of open source and usability is the best choice. Therefore, this paper studies the research and implementation on the human whole-genome sequencing data processing technology based on Hadoop, and puts BWA and GATK tools into the Hadoop framework. Genomic data analysis involves various formats of data which is stored in the HDFS. In this paper, data storage and reading are realized through the establishment of a unified data access layer. This paper abstracts the various steps of business logic of BWA and GATK to MapReduce parallel computing model, and carries on the secondary development of them to realize the parallel analysis of genome data in Hadoop cluster.Genomic data processing based on the Hadoop integrates a variety of analysis tools and has formed a complete sequencing of data processing process. Finally, the speed of the data analysis is greatly increased through practical test and comparison analysis, under the condition that processing program guarantees the accuracy of the analysis results. At the same time, the difficulty of sequencing data processing is greatly reduced with Hadoop’s characteristics of simple operation, convenient and easy to use.

Keywords/Search Tags:

Hadoop, whole genome, call variants, NGS

PDF Full Text Request

Related items

1	Combining Variants Data For Genome Indexing
2	Research On Optimization Of Call Centers With A Queuing Information And Call-Back Option
3	Characterization And Optimization Of Monoclonal Antibody Basic Charge Variants In CHO Cell Culture
4	A computational genomics study: Characterizing genomic variants in non-coding regions of the human genome
5	Genetic Association Analysis Of Rare Variants
6	The Research Of Rare Variants Based On The Genome-wide Association Study
7	Investigation Of The Effects Of Missing Call Bias And Estimation Of CNV Mutation Rate In Human Genome Analysis
8	Parallel Optimization And Implementation Of Massive Genome Annotation Algorithms
9	Genome-wide Patterns Of Large-size Presence/Absence Variants And Their Associations With Agronomic Traits In Sorghum (Sorghum Bicolour)
10	The Call Characteristics And The Mating Call Of The Crested Ibis(Nipponia Nippon)