Study On Parallel Processing And Analysis Methods Of Gene Information In Omics Big Data Context

Posted on:2018-11-17

Degree:Master

Type:Thesis

Country:China

Candidate:Z Z Huang

Full Text:PDF

GTID:2310330512486727

Subject:Control Science and Engineering

Abstract/Summary:

PDF Full Text Request

With the development and maturity of next-generation sequencing technology,high-throughput sequencing has became a routine tool in biomedical research,and will be widely used in agriculture and medical care,such as precision medical and molecular breeding and other emerging industries.However,different from the low flux technology,high-throughput sequencing technology can produce many omics data(whole genome,whole exome,transcriptome and metagenomics),These data have the characteristics of high flux,large amount of data and complex heterogeneity.The processing and analysis steps involved are complicated,and have high requirements for data processing software and hardware.How to quickly,efficiently,accurately process and analyze high-flux sequencing data is a difficult and bottleneck for the widespread application of high flux sequencing technology.For example,the current widespread concern of the precise medical care mainly depends on gene sequencing technology,how to efficiently process and analyze the patient’s massive gene sequencing data,and obtain the personalized cancer driving information from these data,it is the key and difficult problem to realize the accurate diagnosis and treatment of tumor.Sequencing technology from the first generation of sequencing technology to the current latest third-generation sequencing technology,the sequencing cost is significantly reduced,but its output flux has also been explosive growth.The first generation of sequencing technology flux is only 0.2 MB/run,and Illumina as the representative of the second generation of sequencing technology it’s flux can reach 1500 GB/run or so,the third generation of sequencing technology flux is reached 30-400 bp/s.The progress of sequencing technology has provided strong support for the related research,but how to solve the massive sequencing data becomes an urgent problem to be solved.In order to solve this problem,this paper designs a high-throughput sequencing data parallel automation processing software system based on Hadoop technology.Its main purpose is to provide a stable,efficient and inexpensive automated processing tool for massive sequencing data analysis.The core idea of the tool is to use the MapReduce parallel computing framework to segment,compare,and query the relevant sequencing data,and finally output the mutated gene information files or transcribe files.The tool has the following advantages:(1)The tool can be compatible with Illumina and Roche 454 sequencing platform generated by the sequencing data.(2)This tool not only can process DNA-seq data,but also can analyze RNA-seq data.(3)In order to adapt the different hardware environment,we designed two different processing modes are low-performance mode and high-performance mode,through this design makes the tool to adapt different levels of hardware environment.

Keywords/Search Tags:

Next generation sequencing technology, high-throughput sequencing data analysis, cloud computing, Hadoop technology, Precision medicine

PDF Full Text Request

Related items

1	High-throughput genomic assays: Applications and analysis of DSL technology and next-generation sequencing
2	Research On Genome Missembly Identification Method Based On High-throughput Sequencing Data
3	Analyses Of Genome Duplication And Relationships And Alternative Splicing Using High Throughput Sequencing Technology
4	Statistical Model On Next Generation Sequencing
5	Development Of High-throughput Genotyping Methods Based On DNA Microarray And New-generation Sequencing Technologies
6	Screening And Identification Of Anti-FSH Nanobodies Using Phage Display And High-throughput Sequencing
7	Integrated Prokaryotic Analysis Pipeline Based On Next-generation Sequencing Technology
8	Research On Optimizing Processing Algorithms For Next-generation Sequencing Data
9	Accurate Detection Of Low-frequency Mutations Based On DNBSEQ High-throughput Sequencing Technology
10	Research On Detecting Methods Of Indels In Next-generation Sequencing Data Of Human Genome And Establishment Of Detecting Platform For Indels In Genome