Font Size: a A A

Research And Implementation Of Contig Assembly Algorithm On Next Generation DNA Sequencing Data

Posted on:2017-04-05Degree:MasterType:Thesis
Country:ChinaCandidate:Y Y MaFull Text:PDF
GTID:2308330509957493Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Since 21 st century, the advantages of high throughput and low cost of the next generation sequencing technology has greatly promoted the development of bioinformatics. In order to obtain the information contained in the sequencing data, genome assembly algorithm has become the core of the research in this filed.Many excellent genome assembly algorithms emerge as the times require. In recent years, the continuous reform and innovation on the basis of sequencing technology, the new sequencing data has changed. They not only have the characteristics of high throughput and high error rate,but also have new characteristics: the emergence of paired-end data and read length becomes longer. However, the original genome assembly algorithms are unable to use these new characteristics. Therefore,it is an urgent problem to design a new genome assembly algorithm which can make full use the characteristics of the next generation sequencing in genomic area.The whole-genome assembly is divided into two stages:assembling the read to contig which called contig generation progress and assembling the contig to scaffold which called contig assembly progress. The whole-genome de novo sequencing assembly algorithm studied in this paper aim at second stages.On the basis of the existing contig set, using the paired-end data, assembly contig to scaffold.Through contig assembly, we can get the information in the sequencing data, and restore the gene sequence of the target organism, which has a important research significance.In this paper, a new contig assembly algorithm is proposed in view of the characteristics of the new data.The algorithm firstly uses the read which insert size is shorter, looking for the association between paired-end data and contig, according to the relationship to design a correlation evaluation method to give the score of any two contig, so as to determine its relative position. Then, deals with the relationship of the position in the process of assembly, optimize the assembly results. Finally, uses the read which insert size is longer for further processing,output the scaffold sequences.The contig assembly algorithm proposed in this paper makes full use of the characteristics of next generation of data, assemblies the contig to scaffold.In the end of this paper, the result produced by the algorithm proposed in this paper was compared to the results produced by SOAPdenovo2 and Velvet.It is found that the scaffold sequence assembled by our algorithm has higher accuracy and better comprehensive performance,so as to has higher credibility, and lay a good foundation for the follow-up analysis of the genome.
Keywords/Search Tags:de novo, contig assembly, correlation evaluation, paired-end data
PDF Full Text Request
Related items