Font Size: a A A

Error Correction Of NGS Gene Based On Multiple Sequence Alignment

Posted on:2017-05-29Degree:MasterType:Thesis
Country:ChinaCandidate:H H WangFull Text:PDF
GTID:2310330512970626Subject:Computer technology
Abstract/Summary:PDF Full Text Request
NGS(next generation sequencing)technology generates a large number of short gene fragments,which usually contain many errors.Thus,it requires efficient error correction methods.In this case,multiple sequence alignment is viewed as a correction method which compares the difference between character columns by more than two sequences'alignment to find the common structural features for gene expression data analysis.Most of the current multiple sequence alignment algorithms use the multi thread method and can not meet the fast,low-cost processing of massive genetic data.The method of distributed data processing on cloud computing provides a good tool for gene error correction.In the cloud computing platform,the massive data is allocated into different computation nodes for distributed computation.This paper adds the data preprocessing steps to reduce some errors in the gene file,such as base deletion and interruption of other characters.So the process of error correcting program for normalization of data is saved,and the workload of the error correcting program is reduced.In this paper,a gene error correction algorithm MSAC(muiltiple sequence alignment correction)is proposed based on multiple sequence alignment,and a variety of k-mers generation sequence alignment database are applied to improve the error correction performance.In addition,the MSAC algorithm is implemented by Scala language and transplanted to the SPARK platform for distributed processing of genetic data,improving the process of error correction.Experimental results based on real data Staphylococcus aureus(436M),Rhodobacter sphaeroides(242M),Human Chromosome 14(9.6G),Bombus impatiens(92G)prove that the performance of MSAC algorithm is better than Coral and Echo algorithm on the cloud computing platform.The MSAC achieved on SPARK platform shows better capability of distributed data processing,in particular,the average running time of the program is reduced by nearly 30%,memory consumption accounts for about 1/3 of Coral and ECHO.
Keywords/Search Tags:NGS, MSA, error correction, spark
PDF Full Text Request
Related items