Font Size: a A A

Machine Learning Methods And Their Applications In Bioinformatics

Posted on:2010-02-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:S Q WangFull Text:PDF
GTID:1118360272496719Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Bioinformatics is an interdisciplinary subject with start-up of the Human Genome Project at the end of the eighties. It is one of the great frontiers of life sciences and natural sciences. It will be one of core fields of natural sciences in the 21st century's. It is formed from several subjects such as biology, computer science and applied mathematics. Bioinformatics researches include biology data collection and management, database search and sequence alignment, genome sequence analysis, gene expression data analysis and processing, protein structure prediction, and the construction of metabolic pathway, signal pathway and gene regulatory networks, etc.Bioinformatics methods can be used to deal with large-scale data, extract the necessary information, so that we can better understand and reveal the mysteries of living systems. With the accomplishment of the genome sequencing projects, data to analyze and explain is increasing exponentially. So many data and in-depth studies need urgently the developments of theories, algorithms and software. In addition, because of the complexity of the genome data itself, it also needs more urgently the developments of them. Machine learning methods such as neural networks, genetic algorithms, decision tree and support vector machines, etc. are suitable for the field in which there is large amount of data, containing noise and lack of a unified theory.In this thesis, we do some researches on machine learning methods and their applications in bioinformatics. The main jobs include the following four aspects:1. We present a new approach for inducing decision trees based on Variable Precision Rough Set Model (VPRSM). Decision tree classification method is popular in mathine learning. The current methods of constructing decision trees are based on the purity measurement methods, such as information entropy, the Gini index. From the Rough Set theory point of view, the common character of these methods is only to consider the information of implicit region, without considering the information of explicit region. Correspondingly, the rough set based approaches for inducing decision trees consider the information of explicit region. The more certain the information is, the better the results are. In real applications, however, data always contains noises. The methods based on rough set divide accurately the samples, so that they can't avoid that noises effect on constructing the decision tree. In order to reduce the classifier's sensitivity to noise data and improve classifier generalization ability, we introduce variable precision rough set theory in constructing decision tree classifier, and propose approach for inducing decision trees based on Variable Precision Rough Set Model. We propose two main concepts, i.e. variable precision explicit region and variable precision implicit region, and give the algorithm of inducing decision trees based on variable precision rough set model. The comparison between the presented approach and C4.5 on some data sets from the UCI Machine Learning Repository is also reported. Experimental results show the approach for inducing decision trees based on Variable Precision Rough Set Model is superior to the classical decision tree algorithm C4.5, especially before pruning.2. A novel multi-approach guided genetic algorithm for operon prediction is presented. Because the fuzzy rules used in Jacob's approach are intuitive, it is difficult to create its fuzzy rules for non-specialists. Moreover, it used the same method for assessing each genome data, so that it can't explore the biological characteristics for genome data. So we use different methods to preprocess different genome features for exerting their unique characteristics, and utilize intergenic distance, participation in the same metabolic pathway, COG gene functions and microarray expression data to predict operons. A novel local-entropy-minimization method (LEM) is proposed to partition intergenic distance for evaluating intergenic distance. LEM divides the intergenic distances into several intervals and assigns a score for each interval. COG function log-likelihood is computed for adjacent gene pair. Correlation coefficient of microarray expression value is calculated. At last, genetic algorithm is used to fuse the above four genome features and predict operons. The proposed method is examined on Escherichia coli K12 genome, Bacillus subtilis genome, and Pseudomonas aeruginosa PAO1 genome. The accuracies of prediction of 85.9987%, 88.296% and 81.2384% for the three genomes are obtained respectively. Experimental results demonstrate that prediction performance using multiple features is better than that only using one feature. Experimental results also show that it is possible to use intervals of intergenic distance obtained by using Local-Entropy-Minimization method in Escherichia coli for operon prediction in other prokaryotic genome.3. We present an operon prediction methods by decision tree classifier based on Variable Precision Rough Set. We increase two genome features: phylogenetic profile and conserved gene pairs, except for intergenic distance, COG gene functions, metabolic pathway, microarray expression data used in the 4th chapter. We introduce how to extract phylogenetic profile and conserved gene pairs. Firstly we use 360 genomes and BLAST program to compute phylogenetic profile of each gene and conserved gene pairs of each gene pair. Then the hamming distances of phylogenetic profile of adjacent gene pairs are computed. We give frequency distribution and Log-likelihoods for different distances of the phylogenetic profile. At last, we take these six genome features as the input data of the proposed method. The proposed method is examined on Escherichia coli K12, Bacillus subtilis and Pseudomonas aeruginosa PAO1, and is compared with C4.5. Experimental results show that the proposed method is an effective method of operon prediction.4.An entropy-based improved k-TSP method (Ik-TSP) for classifying cancer is proposed. Because the method proposed by Aik Choon Tan chooses the top k high-score pairs of genes as decision rule instead of only the highest gene pair. So, the method needs to calculate the score of each gene pair and determine the decision rules according to the scores of all gene pairs. In fact, each cancer dataset has a huge size (the datasets used in this paper contain at least 2,000 genes), so the algorithm has relatively high time and space complexity. So we propose an entropy-based improved k-TSP method for classifying cancer. We use the information entropy for key genes selection, and then use k-TSP method to predict classes of cancers. In order to evaluate the performance of Ik-TSP method in classification prediction, we consider 9 binary gene expression datasets, which are used by Aik Choon Tan, as our experimental datasets. Leave-one-out cross-validation (LOOCV) is employed to estimate the prediction accuracy in our experiments. Compared with the results of seven other existing machine learning methods, Ik-TSP method obtains averagely 95.44% accuracy, and improves 3% better than k-TSP method. We have obtained some reseaches on operon prediction and cancer prediction. These researches have enriched the study of machine learning theory application. They provide theoretical basis for the application of operon prediction and cancer prediction. Operon prediction provides valuable information for the reconstruction of regulatory networks and drug design. Cancer prediction provides a new method for finding gene marker. It can promote early diagnosis and treatment of cancer.
Keywords/Search Tags:bioinformatics, machine learning, operon prediction, classifying cancers, decision tree, genetic algorithm, variable precision rough set, variable precision explicit region, variable precision implicit region, intergenic distance, COG gene functions
PDF Full Text Request
Related items