Algorithms On Biological Seqence Alignments

Posted on:2009-10-23

Degree:Master

Type:Thesis

Country:China

Candidate:Y Liu

Full Text:PDF

GTID:2120360272961472

Subject:Health Statistics

Abstract/Summary:

MotivationIn recent years,bioinformatics has been developed rapidly.Bioinformatics mainly study sequences,structures and functions of biological macromolecules,mainly including DNA and proteins.Proteins play an important role in biological processes.The analysis of protein functions is a central problem in biology.The function of a protein is not only determined by its primary structure,but closely related with its specific spatial structure.With current biological techniques,protein sequences(primary structures) can be acquired using gene sequencing techniques.However,the determination of the three-dimensional structure of a protein may need a large amount of time and cost.Therefore,it is important in theory and practice to predict protein functions and to classify proteins using protein sequences(primary structures).Along with the genome project,DNA and protein sequence data has increased rapidly.Finding homologous proteins in biological databases becomes an effective method for predicting protein functions and classifying proteins.This thesis studies the problem.Genome rearrangement problem is an important problem in computational biology. Genome rearrangement is an important model of the evolution of microorganism,plants and animals.The process of genome rearrangement is complicated,but there are several basic operations.In the process of mutations,there are mainly several operations:reversal, translocation and transposition.The evolution between biological species is indeed the process of genome mutations.The problem of computing the rearrangement distance between different gene sequences is called the genome sorting problem.This paper studies the signed genome translocation distance problem.MethodSequence alignment and motif identification are two important methods in biological sequence analysis.Proteins with similar amino acid sequences often have similar functions. For a new protein sequence,we can find its subsequences similar to other protein sequences, then predict its functions and classify the new protein based on the functions of similar proteins.Since DNA and protein sequence data is large,fast and effective compute algorithms become a key to finding useful information in large amounts of data.This thesis mainly focuses on the research of the multiple sequence alignment problem and the motif identification problem.We propose a new algorithm for finding local alignment of a group of protein sequences and corresponding motifs.Our algorithm(PSEM) uses two techniques: random motif seed selection and EM refinement.For the signed genome translocation distance problem,we analyze the properties of its breaking point graph and find a method to improve the original algorithm.The key to the improvement is the usage of the split of long cycles and the algorithms for finding and merging sets.ResultWe selected 100 groups of proteins from Pfam database.Each group contained protein sequences from one protein family.Experimental results show that PSEM algorithm can find high quality motifs for each group of protein sequences.We also tested on classification of new protein sequences based on the discovered motifs.Experimental results also show that our method has high accuracy in classification of protein sequences.For the signed genome translocation distance problem,we give an O(nlog~*n) algorithm,which improved the previous O(n~2) algorithm.ConclusionThis thesis mainly studies the method of using protein sequences to find homologous proteins,then using homologous proteins to predict protein functions and classify proteins. We proposed a new PSEM algorithm for finding motifs in a group of protein sequences,and used the discovered motifs to classify proteins.Experimental results show that PSEM algorithm can find high quality motifs,and that the discovered motifs can classify protein sequences with high accuracy.Therefore,PSEM algorithm is an effective method for protein sequence analysis.The fast algorithm for the signed genome translocation distance problem also provides a method for the genome rearrangement distance problem.

Keywords/Search Tags:

bioinformatics, sequence alignment, genome sorting, algorithm, protein classification and motifs

Related items

1	Research On Multiple Sequence Alignment Method Based On Single Molecule Sequencing Data
2	Study Of Protein Structure Classification
3	The Research Of Sequence Alignment In Bioinformatics
4	Research On Genomic Sequence Alignment Methods Based On High-throughput Sequencing Data
5	Biological Sequence Alignment Algorithm And A Comparative Study
6	A Feature Extraction Algorithm For G Protein-coupled Receptor Classification
7	Bioinformatics Platform To Build And Sequence Alignment Algorithm Study
8	The Research Of Sequence Alignment In Bioinformatics
9	Study On Optimization Of Gene Sequence Alignment Algorithm
10	The Algorithms Of Sequence Alignment In Bioinformatics