Research On NGS Data Processing Algorithm Based On Hadoop Platform

Posted on:2020-01-06

Degree:Master

Type:Thesis

Country:China

Candidate:Y R Fang

Full Text:PDF

GTID:2428330599450901

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

The development of NGS technology produces a large number of sequencing short sequences.Sequence alignment is the process of mapping short sequences to reference genomes,which is of great research significance for biological homology analysis,SNP locus prediction and disease prediction.Since the NGS data contains plenty of repetitive sequences and processing these sequences will cause unnecessary resource consumption,so sequence deduplication is a common preprocessing operation.Many methods for redundancy and alignment exist now,but they are problematic in terms of processing large-scale sequencing data such as time inefficiency.Recently,academia has proposed some parallel processing algorithms based on big data.While the efficiency has changed greatly,it still needs to be improved to further improve performance.For this issue,this paper studies and implements the large-scale sequence deduplication and sequence alignment algorithm based on Hadoop.The main contents and conclusions are as follows:(1)Research and improvement of sequence deduplication algorithmAiming at the large number of repetitive sequences in sequencing data,this paper studies the parallel deduplication algorithm based on the prefix-suffix idea and improves it on this basis.The improvement includes two aspects:(1)The original deduplication result still contains benchmark repeat sequences.According to this,the study removes benchmark repeat sequences while processing replicated data,thus the rate of reduplication increases.(2)The original deduplication result includes many low-quality sequences.According to this,this paper proposes to combine quality control with reduplication to filter low-quality sequences and improve the quality of sequencing data.(2)Parallelization of sequence alignment algorithmAiming at the low efficiency of sequence alignment algorithm in processing large-scale data,this paper designs and realizes the parallel sequence alignment algorithm BigBowtie based on Hadoop by means of JNI calling dynamic library.This algorithm has two separate software layers to avoid changing the original code and ensures the compatibility between different versions of Bowtie2.The parallelization sequence alignment algorithm is divided into data format conversion,data distribution,sequence alignment,and results summary,which realizes the parallelization of Bowtie2 and shortens the execution time.The experimental results show that the most improvement of this improved algorithm in deduplication is 1.74%,and the highest proportion of non-repeat sequences is 99.75%.All indexes of base quality score have been improved,providing reliable quality guarantee for downstream data processing and analysis.Comparing to Bowtie2,BigBowtie has a speedup of 7.79 in maximum and reduces operating time of 22261 s.As for the existing Hadoop-based parallel algorithm BigBWA,BigBowtie runs for a shorter time.

Keywords/Search Tags:

NGS, Hadoop platform, Sequence deduplication, Sequence alignment

PDF Full Text Request

Related items

1	The BWT Index Building Method For A Gene Sequence Alignment Research On Hadoop
2	The Research And Implementation Of Biological Sequence Alignment
3	Research Of Improvement And Parallelization For Sequence Assembly And Multiple Sequence Alignment
4	Biological Sequence Alignment Problem
5	The Application Of ACO And Coding Method In Sequence Analysis
6	Research On Sequence Alignment Algorithms In Bioinformatics
7	Automatically Get To Build The Study Of Biological Information Platform And Sequence Alignment Algorithm Based On Information
8	Study On Biology DNA Sequence Alignment Algorithm
9	Research Of Go Functional Annotation Platform With Homology Search Based On Hadoop
10	Research On Multiple Sequence Alignment Algorithms In Bioinformatics