Font Size: a A A

Research On NGS Data Processing Algorithm Based On Hadoop Platform

Posted on:2020-01-06Degree:MasterType:Thesis
Country:ChinaCandidate:Y R FangFull Text:PDF
GTID:2428330599450901Subject:Engineering
Abstract/Summary:PDF Full Text Request
The development of NGS technology produces a large number of sequencing short sequences.Sequence alignment is the process of mapping short sequences to reference genomes,which is of great research significance for biological homology analysis,SNP locus prediction and disease prediction.Since the NGS data contains plenty of repetitive sequences and processing these sequences will cause unnecessary resource consumption,so sequence deduplication is a common preprocessing operation.Many methods for redundancy and alignment exist now,but they are problematic in terms of processing large-scale sequencing data such as time inefficiency.Recently,academia has proposed some parallel processing algorithms based on big data.While the efficiency has changed greatly,it still needs to be improved to further improve performance.For this issue,this paper studies and implements the large-scale sequence deduplication and sequence alignment algorithm based on Hadoop.The main contents and conclusions are as follows:(1)Research and improvement of sequence deduplication algorithmAiming at the large number of repetitive sequences in sequencing data,this paper studies the parallel deduplication algorithm based on the prefix-suffix idea and improves it on this basis.The improvement includes two aspects:(1)The original deduplication result still contains benchmark repeat sequences.According to this,the study removes benchmark repeat sequences while processing replicated data,thus the rate of reduplication increases.(2)The original deduplication result includes many low-quality sequences.According to this,this paper proposes to combine quality control with reduplication to filter low-quality sequences and improve the quality of sequencing data.(2)Parallelization of sequence alignment algorithmAiming at the low efficiency of sequence alignment algorithm in processing large-scale data,this paper designs and realizes the parallel sequence alignment algorithm BigBowtie based on Hadoop by means of JNI calling dynamic library.This algorithm has two separate software layers to avoid changing the original code and ensures the compatibility between different versions of Bowtie2.The parallelization sequence alignment algorithm is divided into data format conversion,data distribution,sequence alignment,and results summary,which realizes the parallelization of Bowtie2 and shortens the execution time.The experimental results show that the most improvement of this improved algorithm in deduplication is 1.74%,and the highest proportion of non-repeat sequences is 99.75%.All indexes of base quality score have been improved,providing reliable quality guarantee for downstream data processing and analysis.Comparing to Bowtie2,BigBowtie has a speedup of 7.79 in maximum and reduces operating time of 22261 s.As for the existing Hadoop-based parallel algorithm BigBWA,BigBowtie runs for a shorter time.
Keywords/Search Tags:NGS, Hadoop platform, Sequence deduplication, Sequence alignment
PDF Full Text Request
Related items