Font Size: a A A

Algorithms On Biological Seqence Alignments

Posted on:2009-10-23Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiuFull Text:PDF
GTID:2120360272961472Subject:Health Statistics
Abstract/Summary:
MotivationIn recent years,bioinformatics has been developed rapidly.Bioinformatics mainly study sequences,structures and functions of biological macromolecules,mainly including DNA and proteins.Proteins play an important role in biological processes.The analysis of protein functions is a central problem in biology.The function of a protein is not only determined by its primary structure,but closely related with its specific spatial structure.With current biological techniques,protein sequences(primary structures) can be acquired using gene sequencing techniques.However,the determination of the three-dimensional structure of a protein may need a large amount of time and cost.Therefore,it is important in theory and practice to predict protein functions and to classify proteins using protein sequences(primary structures).Along with the genome project,DNA and protein sequence data has increased rapidly.Finding homologous proteins in biological databases becomes an effective method for predicting protein functions and classifying proteins.This thesis studies the problem.Genome rearrangement problem is an important problem in computational biology. Genome rearrangement is an important model of the evolution of microorganism,plants and animals.The process of genome rearrangement is complicated,but there are several basic operations.In the process of mutations,there are mainly several operations:reversal, translocation and transposition.The evolution between biological species is indeed the process of genome mutations.The problem of computing the rearrangement distance between different gene sequences is called the genome sorting problem.This paper studies the signed genome translocation distance problem.MethodSequence alignment and motif identification are two important methods in biological sequence analysis.Proteins with similar amino acid sequences often have similar functions. For a new protein sequence,we can find its subsequences similar to other protein sequences, then predict its functions and classify the new protein based on the functions of similar proteins.Since DNA and protein sequence data is large,fast and effective compute algorithms become a key to finding useful information in large amounts of data.This thesis mainly focuses on the research of the multiple sequence alignment problem and the motif identification problem.We propose a new algorithm for finding local alignment of a group of protein sequences and corresponding motifs.Our algorithm(PSEM) uses two techniques: random motif seed selection and EM refinement.For the signed genome translocation distance problem,we analyze the properties of its breaking point graph and find a method to improve the original algorithm.The key to the improvement is the usage of the split of long cycles and the algorithms for finding and merging sets.ResultWe selected 100 groups of proteins from Pfam database.Each group contained protein sequences from one protein family.Experimental results show that PSEM algorithm can find high quality motifs for each group of protein sequences.We also tested on classification of new protein sequences based on the discovered motifs.Experimental results also show that our method has high accuracy in classification of protein sequences.For the signed genome translocation distance problem,we give an O(nlog~*n) algorithm,which improved the previous O(n~2) algorithm.ConclusionThis thesis mainly studies the method of using protein sequences to find homologous proteins,then using homologous proteins to predict protein functions and classify proteins. We proposed a new PSEM algorithm for finding motifs in a group of protein sequences,and used the discovered motifs to classify proteins.Experimental results show that PSEM algorithm can find high quality motifs,and that the discovered motifs can classify protein sequences with high accuracy.Therefore,PSEM algorithm is an effective method for protein sequence analysis.The fast algorithm for the signed genome translocation distance problem also provides a method for the genome rearrangement distance problem.
Keywords/Search Tags:bioinformatics, sequence alignment, genome sorting, algorithm, protein classification and motifs
Related items