Font Size: a A A

Research And Implementation On The Human Whole-genome Sequencing Data Processing Technology Based On Hadoop

Posted on:2016-07-03Degree:MasterType:Thesis
Country:ChinaCandidate:J J LinFull Text:PDF
GTID:2180330461483283Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of Next-generation sequencing(NGS) technologies, we have the ability to complete a person’s whole genome sequence in a very short period of time. This heralds the era of personalized medicine based on genetic information has arrived. However, the NGS sequencing platforms can get billions of DNA sequence reads in a single sequencing, and these sequences are equivalent to hundreds of billions of bytes of data. At the same time, the generation of sequencing data quantity and sequencing cost is rapidly increasing or decreasing at the speed far exceeding Moore’s law respectively. Storage and analysis of these data are faced with great challenge. Sequencing data has the characteristics of various forms and source, so the pretreatment, management and analysis of these data are beyond the reach of many biological information scientists. In order to carry out large-scale data analysis and efficient mining, high information integration and the perfect combination of various tools are requierd.At present, the most widely used sequencing data analysis tools are Burrows-Wheeler Aligner(BWA) and genome analysis toolkit(GATK). BWA is a sequence alignment tool, which has the characteristics of high accuracy and high speed, and GATK is a kind of call variants tools, which is widely used because of its high accuracy. Now Hadoop is the most popular solution for processing big data. For “big data” like personal whole genome sequencing, Hadoop distributed framework which have the characteristics of open source and usability is the best choice. Therefore, this paper studies the research and implementation on the human whole-genome sequencing data processing technology based on Hadoop, and puts BWA and GATK tools into the Hadoop framework. Genomic data analysis involves various formats of data which is stored in the HDFS. In this paper, data storage and reading are realized through the establishment of a unified data access layer. This paper abstracts the various steps of business logic of BWA and GATK to MapReduce parallel computing model, and carries on the secondary development of them to realize the parallel analysis of genome data in Hadoop cluster.Genomic data processing based on the Hadoop integrates a variety of analysis tools and has formed a complete sequencing of data processing process. Finally, the speed of the data analysis is greatly increased through practical test and comparison analysis, under the condition that processing program guarantees the accuracy of the analysis results. At the same time, the difficulty of sequencing data processing is greatly reduced with Hadoop’s characteristics of simple operation, convenient and easy to use.
Keywords/Search Tags:Hadoop, whole genome, call variants, NGS
PDF Full Text Request
Related items