Font Size: a A A

Applications Of Machine Learning Approaches To Biological Sequence Analysis

Posted on:2014-01-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:J SongFull Text:PDF
GTID:1268330425996881Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
Bioinformatics is an emerging interdisciplinary research area that utilizes computational approaches to solve molecular biology problems. The ultimate goal of bioinformatics is to discover biological patterns and information hidden in the large amount of available biological data and use the information to improve the understanding of biological processes that are crucial for most organisms. Biological sequence analysis has become a fundamental task of bioinformatics. Recently, due to the explosive growth of the available biological data, machine learning methods have become increasingly important for analyzing biological sequence and extracting important information from it. Machine learning is an area that focuses on the development of algorithms that can capture complex patterns hidden in large amount of empirical data and make intelligent decisions based on them. The evolution of biological systems is determined by complex underlying mechanisms, machine learning approaches thus provide ideal tools to the exploration and analysis of them.In this work machine learning methods were used to analyse and process the biological sequence. The main contributions include:1、Computing the optimal alignment of multiple sequences is an NP-hard problem and a large number of approximate and heuristic approaches have been developed to efficiently find alignments that are close to the optimal one. However, most of the existing alignment tools use methods designed to optimize the score function associated with an alignment and thus can only generate one single alignment for a set of protein sequences. But it has been verified recently that the alignment which optimizes the score function may not be the one which is biologically desirable. In this dissertation, we develop a new machine learning approach that can efficiently and effectively explore the alignment space of multiple protein sequences and generate a list of alignments that have leading alignment scores. This new approach employs an ensemble of Hidden Markov Models (HMMs) to generate a list of alignments that have the leading alignment scores. The method progressively constructs and updates a set of alignments by adding sequences in a certain order to each of the existing alignments. Each of the existing alignments is modeled with a profile HMM and an added sequence is aligned to each of these profile HMMs. In addition, a double-sequence alignment method which can accurately calculate a list of alignments which have the leading scores in quadratic time was proposed and proved. To integrate the secondary structure information into the model, we used a simple score matrix to compute the secondary structure matching score and include the matching score in the overall alignment score. Our experiments show that the alignment accuracy can be further improved while taking into account structure information.2、In response to the shortcomings of low computational efficiency of the non-coding RNA sequence search software, a new model was developed to describe the secondary structure of a non-coding RNA molecule. Firstly, investigation of the members of non-coding RNA families shows that it is extremely unusual for the best fit of a true family member to the CM to deviate in length very much from the consensus length represented at any of the model states. According to this result, the secondary structure of RNA family was divided into a number of basic structural units, each of which is a stem or a loop. In particular, based on the length distribution of a structure unit in the secondary structure, the number of insertions and deletions that may occur in the structure unit during the evolution was restricted. Test results show that the computation time needed to align a sequence to this new model can be significantly reduced. The speed up that can be achieved by our model is significant for sequences that contain a large number of nucleotides.3、Transcription factor binding sites are important for gene regulation and accurately predicting their locations in the promoter regions is crucial for understanding the regulation of the expression levels of particular genes. Identifying transcription factor binding sites with experimental methods is often expensive and time-consuming. Although many computational approaches and tools have been developed for this problem, the prediction accuracy is not satisfactory. In this paper, we develop a new computational approach that can model the relationships among all short sequence segments in the promoter regions with a graph theoretic model. Based on this model, finding the locations of transcription factor binding site is reduced to computing maximum weighted cliques in a graph with weighted edges. Meanwhile, in order to improve the speed of looking for the optimal solution, we proposed a pretreatment method which can significantly reduce the size of the graph. Finally, we develop an enumeration algorithm that can efficiently search the graph model and find the maximum edge-weight clique, which represent the predicted transcription factor bind sites.4、A large number of methods have been developed for the clustering of DNA microarray data. One problem with these methods is that classes of functionally related genes are sometimes not well defined, while most of these methods can generate only one clustering result. In this dissertation, we develop a new algorithm that can cluster DNA microarray data with a graph based algorithm. In particular, a DNA microarray data set is represented by a graph where edges are weighted. We then apply an algorithm that can compute the minimum weighted and second minimum weighted graph cuts respectively to the graph. The graph can then be partitioned into two subgraphs by each graph cut. We then apply the algorithm recursively on these two subgraphs until no additional clusters can be found. Our contribution is an original polynomial time algorithm that can compute both the minimum weighted and second minimum weighted graph cuts in a weighted graph and an approach that can maintain, update and output a list of clustering results with significant likelihood. Finally, the definition of high connectivity graph was extended to be used in undirected weighted graph, and was taken as termination criteria for graph partitioning.
Keywords/Search Tags:Multiple Sequence Alignment, DNA microarray data, Hidden Markov Models, Clustering algorithm, Bioinformatics, Transcription Factor Binding Sites, Ensemble Learning, Covariance Model, Stochastic Context-free Grammars Model, Noncoding RNA
PDF Full Text Request
Related items