Font Size: a A A

Research On InDel Detection Approach Based On Short Sequence Alignment

Posted on:2016-07-21Degree:MasterType:Thesis
Country:ChinaCandidate:X D WangFull Text:PDF
GTID:2310330509457045Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Along with the proliferation and accelerating development of the next-generation sequencing technology which is also known as the birth of high-throughput sequencing technology, the cost of sequencing became lower and lower, at the same time, the flux of sequencing became bigger and bigger. This phenomenon promoted the study of bioinformatics greatly. The detection and analysis of InDel(small insertion and deletion) based on sequence alignment can help to find some gene positions related to disease, explore the pathology of disease and determine the therapeutic scheme. But the massive data and high requirement for accuracy brought a huge challenge for InDel detection. Therefore, this article gave some discussion and research for the difficulties of InDel detection research based on DNA short sequence alignment.It would arise two problems in the alignment procedure, if we mapped the short reads to the reference directly. One was high requirement of computational complexity in mapping step, and the other one was that if the read could map to the reference sequence, the read would map to the first matching position which maybe was not the optimal result. In order to settle the aforementioned two alignment problems, this article first built a hash table for the seed set from the reference sequence extracted by the sliding window method, and the hash table would help to find the correct match position in the alignment procedure. What was noteworthy was that building a hash table chewed up a lot of memory. Hence, this article compressed every bp on the reference sequence by using binary arithmetic codes while it was building the hash table, which reduced three quarters of memory occupation.There were also two problems in InDel detection procedure. One was that the length of the reads produced by high-throughput sequencing technology was very short, let alone a seed which was a sub-sequence of a read, thus a seed usually could map to multiple positions on the reference sequence, and this condition would lead to inaccurate mapping result. And the other one was that this it's possible that the seed covered the InDel, because of that the InDel's distribution in the read is random, resulting in that the seed was matched with the wrong position on the reference sequence. For the sake of improving the correctness of InDel detection, this article proposed that for a read, we first chose some sub-sequences, which were extracted by the sliding window method, and then we mapped these sub-sequences to the reference sequence to get each one's candidate locations set, and in order to decrease false positive of InDel, we applied supportNum, finally based on supportNum, we added a parameter called threshold into evaluation procedure.On the choice of the alignment method, this article chose Needleman-Wunsch which was a representative of global alignment method, because the InDels were very short, just 1-2bp, so it was suitable that applying global alignment method to align the reads. In order to further improve performance of our algorithm, we proposed that re-evaluating the intersection of good InDel detection results. Finally, experiments on our data sets certified that the InDel detection result of the method proposed by this article was favorable.
Keywords/Search Tags:high-throughput sequencing technology, dna short reads alignment, indel detection
PDF Full Text Request
Related items