Font Size: a A A

LSS-DCA: A Fast Multiple Sequence Alignment With Divide-and-conquer Algorithm

Posted on:2004-08-23Degree:MasterType:Thesis
Country:ChinaCandidate:H H GongFull Text:PDF
GTID:2168360095951145Subject:Biochemistry and Molecular Biology
Abstract/Summary:PDF Full Text Request
With the arrival of the post-genome era and recent development of new, high-throughput technologies to mine data in Biology, vast amounts of sequence data are flooding the DNA and protein databases so rapidly that there is a strong need for efficient as well as effective computational tools to handle these data. Among these tools, Multiple Sequence Alignment (MSA) is the most important because MSA is central to most Bioinformatics techniques. Due to the NP-completeness of optimal MSA, any attempt of developing a fast algorithm to compute optimal multiple sequence alignments is expected to fail. So a good MSA program tries to get a best balance between alignment quality and computation time. Divide-and-conquer alignment (DCA) method is one of such programs.The ideal of DCA that cut long sequences set into some shorter sequences set is expected to produce fast multiple alignment programs. Because the computational complexity of MSA grows exponentially with the length of the sequences, and division can extremely reduce the computational complex and lead to fast alignment speed eventually. Firstly, DCA cut sequences in original family at a position near to their midpoint, and two new subfamilies of shorter sequences are obtained. Repeat this way until sequences in obtained new subfamilies are enough short to be aligned easily. After aligning these new families optimally, the original family is also aligned by concatenating these resulting alignments. The key question arising in DCA is how to compute slicing points in every sequence. In fact, the DCA program computes the slicing point with a greedy method and thus the speed of DCA is not fast as expected.In this paper, we present a new alignment method (LSS-DCA) that combines the advantages of the DCA method and progressive method, and that balances the speed and the quality better. LSS-DCA employs a simplified but very strict progressive method to localize the obvious longest similarity segments (LSS) in the sequences and then cut the sequences at the two terminals of LSS. The prefixes and suffixes of LSS are recursively processed with the same way until all LSSes are detected and aligned. A final MSA is obtained by filling the intervals between localized LSSes with the counterpart sequences segments in original sequences or gaps.The progressive method to compute LSSes is so strictly that it is very possible to find no LSS in sequences, which leads to fail of DCA procedure. In order to avoid such failure, another vertical DCA (As opposed to horizontal DCA along the sequences) procedure is introduced to close the relative ofsequences by clustering the original sequences into two closer sequence sets. The clustering is iterative until all sequences are classified into different groups that can be found LSS easly. The two DCA (vertical and horizontal) procedures are cross, but the horizontal DCA has higher priority. Only if the sequences are not fit for the horizontal DCA procedure, the vertical DCA is executed. An extremity is that all sequences are long distance and every sequence is classified different groups. In this case, LSS-DCA aligned the sequences with the same way as ClustalW.An implementation of LSS-DCA is developed and the time and memory requirement of the program are assessed. Several examples are shown and the alignment results produced by LSS-DCA are compared systematically to the results of six other alignment programs. The results show that LSS-DCA can produce good alignment with a fast speed. But the results also indicate that there is a kind of sequence sets in which one or more than one of the sequences share two or more than two similar conserved region and that the alignment of those sets is not as good as other sets produced by LSS-DCA.
Keywords/Search Tags:Bioinformatics, Multiple sequence alignment (MSA), Divide-and-conquer alignment (DCA), Longest similarity segment (LSS)
PDF Full Text Request
Related items