Font Size: a A A

Study On Parallel Processing And Analysis Methods Of Gene Information In Omics Big Data Context

Posted on:2018-11-17Degree:MasterType:Thesis
Country:ChinaCandidate:Z Z HuangFull Text:PDF
GTID:2310330512486727Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
With the development and maturity of next-generation sequencing technology,high-throughput sequencing has became a routine tool in biomedical research,and will be widely used in agriculture and medical care,such as precision medical and molecular breeding and other emerging industries.However,different from the low flux technology,high-throughput sequencing technology can produce many omics data(whole genome,whole exome,transcriptome and metagenomics),These data have the characteristics of high flux,large amount of data and complex heterogeneity.The processing and analysis steps involved are complicated,and have high requirements for data processing software and hardware.How to quickly,efficiently,accurately process and analyze high-flux sequencing data is a difficult and bottleneck for the widespread application of high flux sequencing technology.For example,the current widespread concern of the precise medical care mainly depends on gene sequencing technology,how to efficiently process and analyze the patient's massive gene sequencing data,and obtain the personalized cancer driving information from these data,it is the key and difficult problem to realize the accurate diagnosis and treatment of tumor.Sequencing technology from the first generation of sequencing technology to the current latest third-generation sequencing technology,the sequencing cost is significantly reduced,but its output flux has also been explosive growth.The first generation of sequencing technology flux is only 0.2 MB/run,and Illumina as the representative of the second generation of sequencing technology it's flux can reach 1500 GB/run or so,the third generation of sequencing technology flux is reached 30-400 bp/s.The progress of sequencing technology has provided strong support for the related research,but how to solve the massive sequencing data becomes an urgent problem to be solved.In order to solve this problem,this paper designs a high-throughput sequencing data parallel automation processing software system based on Hadoop technology.Its main purpose is to provide a stable,efficient and inexpensive automated processing tool for massive sequencing data analysis.The core idea of the tool is to use the MapReduce parallel computing framework to segment,compare,and query the relevant sequencing data,and finally output the mutated gene information files or transcribe files.The tool has the following advantages:(1)The tool can be compatible with Illumina and Roche 454 sequencing platform generated by the sequencing data.(2)This tool not only can process DNA-seq data,but also can analyze RNA-seq data.(3)In order to adapt the different hardware environment,we designed two different processing modes are low-performance mode and high-performance mode,through this design makes the tool to adapt different levels of hardware environment.
Keywords/Search Tags:Next generation sequencing technology, high-throughput sequencing data analysis, cloud computing, Hadoop technology, Precision medicine
PDF Full Text Request
Related items