Font Size: a A A

Research On Protein Multiple Sequence Alignment Method Based On Hidden Markov Model

Posted on:2020-04-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q ZhanFull Text:PDF
GTID:1360330614950616Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The similarity of a set of biological sequences implies functional similarity or suggests divergence from a common ancestor,and a common way to find out how similar the sequences are is to align them,i.e.,to organize homologous positions across different sequences in columns.This method of multiple sequence alignment also helps biologists isolate relevant regions in the sequences,and the identification of these regions is important to various analyses such as protein secondary structure prediction,phylogenetic tree reconstruction and functional inference through protein domain profile comparison.Since the MSA problem is an NP problem,it cannot obtain a theoretical optimal solution using a standard dynamic programming algorithm.For the study of this problem,progressive alignment is the most widely used kind of approximation algorithm.This type of algorithms first determine the distance matrix by pairwise alignment between sequences,and then determine a guiding tree base on the distance matrix.Finally,According to the order given by the guide tree,the alignment is constructed progressively,and the alignment result is iterated several times to obtain better alignment result.In general,this type of approximate algorithms perform a pairwise sequence or profile alignment progressively,to transform the multiple alignment into successive pairwise alignments,and finally find an approximate solution to the problem.Aiming at the problems existing in the research of existing MSA methods,this paper has carried out various aspects of MSA method such as the substitution score of residues,the construction of the guide tree and the realign refinement of the alignment results.The main research contents of this paper include the following aspects:(1)Aiming at the problem that the fixed substitution scoring method cannot accurately reflect the position-specificity and sequence-consistency of residue pairs,the protein family alignments with lower homology are less accurate.This paper proposes a residue substitution scoring method based on hidden Markov model optimization and combination.The probability of residue pairs is used as the substitution score in the dynamic programming of pairwise alignment,which is very important in the entire alignment process.In some previous studies,optimization algorithms such as particle swarm optimization and genetic algorithms were used to optimize the hidden Markov models for MSAs.Some studies also used combined models of hidden Markov models andother probability models such as partition function to calculate the posterior probability.However,there is currently no research on the method of combining the hidden Markov model and the partition function optimized by the optimization algorithm to calculate the posterior probability.The method proposed in this paper combines this two models and compares it with many other similar methods.The experimental results show that the posterior probability of residue pairs calculated by this method can be used as a substitution score,which can effectively improve the accuracy of alignment,especially on less homologous protein families.(2)Aiming at the current MSA algorithms,a fixed guide tree construction method is used,which cannot accurately reflect the relationship between protein sequences with different identities.This paper proposes an adaptive guide tree construction method.For protein families with different identities,the corresponding hidden Markov models are used to construct their guide trees.In the progressive alignment process,the previous mismatches will always be retained,which will affect the subsequent alignment process.Therefore,the order of the alignment is important.Existing algorithms use a fixed guide tree construction method,which cannot reflect the relationships between protein sequences in a targeted manner.This paper proposes an adaptive guide tree construction method.According to the difference of identities,a corresponding model is used to construct the guide tree.Experimental results show that the guide tree constructed by this method can improve the accuracy of alignments,especially on protein families with lower identity.(3)For the current protein family with long nucleic / carboxy terminal extensions,the realign refinement method based on horizontally division cannot rule out long flanking interference and the accuracy of the alignment is poor.This paper proposes a method based on vertically division.The current MSA algorithms regroup or reorder sequences during the realign refinement process,and then perform realignment.These methods contrastly divide the results horizontally,taking into account the similarity relationship between the sequences,but do not consider the regions that have different conservations on the alignment results.This paper presents a realign refinement method based on vertically division.The experimental results show that the use of this method for refinement of alignment results can improve the accuracy of alignment results,especially on protein families with long nucleic / carboxyl terminal extensions.(4)Based on the researches above on the multiple key steps of the progressive MSA method,this paper proposes an integrated fusion MSA method.According to the differentidentities of the protein family,a corresponding model is adopted to construct the guide tree.On families with lower identity,the particle swarm optimized hidden Markov model and the partition function model are combined to calculate the posterior probability.On families with higer identity,the local and global hidden Markov models are used respectively;and the generated alignment results are realigned by using the method based on vertical division.The integrated fusion method is compared with many other similar methods on three benchmark data sets.The experimental results show that the integrated fusion method can comprehensively improve the accuracy of sequence alignment and provide more solidity for downstream biological analysis.
Keywords/Search Tags:multiple sequence alignment, hidden Markov model, particle swarm optimization, partition function, splitting vertically
PDF Full Text Request
Related items