Font Size: a A A

Research On Application Of Machine Learning And Data Mining In Bioinformatics

Posted on:2012-01-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:W DuFull Text:PDF
GTID:1118330332499418Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Bioinformatics, which is a new cross-disciplinary science, is the application of statistical theory and computer science to molecular biology. The researchers of bioinformatics use a variety of tools in statistical theory and computer science to analyze large-scale and high-throughput biological experimental data. By these analyses, we can understand the significance of biological experiments and further understand the biological processes. Machine learning and data mining are the important areas of computational science. With the development of science and technology, machine learning and data mining have been greatly developed. In recent years, the application of machine learning and data mining technology to solve bioinformatics problems has achieved great success, such as large-scale and high-throughput sequencing data and microarray data analysis.However, for some bioinformatics problem, machine learning and data mining technology need to be improved, especially in genome annotation, computational evolutionary biology and analysis of gene expression which are the important areas of bioinformatics research. In this article, we use machine learning and data mining methods to research on operon prediction of genome annotation, phylogenetic tree construction of computational evolutionary biology and microarray data feature selection of gene expression data analysis. The main contributions are as follows:1) A new operon prediction method based on neural network is proposed. The method predicts operons by generalized regression neural network (GRNN) based on four types of genomic data with their log-likelihood values after being optimized by wavelet transform. After estimating log-likelihood value distribution for intergenic distances, COG gene functions, conserved gene pairs and phylogenetic profiles respectively, we optimize these results using wavelet transform. Finally, the values are input into GRNN, which integrates the four types of data to predict the operon structure of the genome. The method can be used to obtain average sensitivity, specificity and accuracy at 88.6%, 89.2%, 88.9% and 87.4%, 85.5%, 86.3% for E. coli K12 and B.subtilis str.168, respectively. The genomic information, obtained from the calculation, is independent on the experiment which makes the method more flexible to be applied to new sequence genomes.2) A new operon-prediction model based on graph clustering model by Markov Clustering is proposed, which is called OPMC. Depending on the development of sequencing, the operons of new species can be effectively predicted. However, only few of the existing researches provide successful implementations to predict operons for newly sequenced organisms. Because of using genome-specific features or an overfitting classifier, most of current methods have unsatisfactory generalized capability of operon prediction. The method, OPMC, works without a classifier and exploits several features. The clustering operates on four types of genomic features: intergenic distances, conserved gene clusters, gene ontology (GO) similarity and minimum free energy. Different from above prediction models and approaches, we apply the gene clusters instead of gene pairs to mapping actual operons. Because genes within an operon have same transcription direction and much shorter intergenic distances, we first consider genes of the same strand and generate candidate gene clusters by intergenic distances. Secondly, the four features of genes in the same candidate cluster are calculated and estimated by log-likelihood scores. Finally, the predicted operons are obtained from candidate gene clusters using Markov Clustering. Similarly to most operon prediction methods, we use genome E. coli K12 and Bacillus subtilis to assess the prediction capability in single species validation of the presented method, and obtain the average sensitivity, specificity, accuracy at 92.9%, 90.2%, 91.7%, and 89.9%, 88.4%, 89.1%, respectively. The experimental results show that the proposed method has a powerful capability of operon prediction, and are better than several other popular operon prediction programs. The experimental results show that our method can not only have good performance in single species, but also can get significant results on new species.3) A novel method for inferring prokaryotic phylogenies based on orthologous genes within whole genomes is proposed. The evolutionary distance between two genomes is measured by the number of continuous orthologous genes. Firstly, we measure orthologous genes from COGs (Clusters of Orthologous Groups). Secondly, the distance matrix is calculated by the number of orthologous genes which are continuous. Finally, the Neighbor-Joining (NJ) method based on that distance is used to construct the phylogenetic tree. In order to facilitate result comparison, the proposed method is examined on different datasets from 398 prokaryotic genomes, both on the fixed and random datasets. The results achieve average accuracy over 90% agreements with Bergey's taxonomy in quartet topologies on these datasets. Simulation results show that the proposed method has a powerful capability for phylogenetic analysis.4) A novel method for inferring prokaryotic phylogenies based on distance matrix of orthologous gene clusters on whole-genome by multiple genome information is proposed. Most of above methods only used one kind of genome information, such as sequence similarity, genomic function or genomic structure. Moreover, these methods didn't consider horizontal gene transfers (HGT) events and paralogous genes, so they classified some species on phylogenetic tree incorrectly. Our method considers the potential genes that involved potential horizontal gene transfers (HGT) event. The distance of orthologous gene clusters between two genomes is calculated by the number of orthologous genes in conserved clusters. Firstly, we measure orthologous genes by considering sequence similarity, genomic function and genomic structure information. Secondly, the potential genes that involve potential horizontal gene transfers (HGT) event are eliminated according to certain regulations. After that, the distance matrix is calculated by the number of orthologous genes in conserved clusters. Finally, the Neighbor-Joining (NJ) method based distance is employed to construct the phylogenetic tree. The proposed method is examined on different datasets from 617 prokaryotic genomes. The results achieve average accuracy above 93% agreement with Bergey's taxonomy in quartet topologies on these datasets. Simulation results show that the proposed method has consistently better performance on different datasets than other existing methods, so the method has a powerful capability for phylogenetic analysis.5) An improved global normalized signal to noise ratio (gn-SNR) method for irrelevant genes removing is proposed. Most of these original genes in microarray datasets are irrelevant genes, so irrelevant genes eliminating is an important stage of feature selection for microarray expression data analysis. The method eliminates the irrelevant genes by considering mean value and standard deviation with global normalization between different kinds of samples. Firstly, the contribution of mean value between two kinds of samples is measured as the original SNR method. Secondly, the global normalized contributions of mean value and standard deviation between two kinds of samples are computed. Finally, we use a threshold to remove irrelevant genes by considering mean value and standard deviation between two kinds of samples for microarray expression data analysis. We examine the method on microarrays of Leukemia dataset, Prostate dataset and Colon dataset. The best and mean accuracies of the method in these datasets are (96.07%, 91.56%, 94.50%) and (92.38%, 86.76%, 88.54%), respectively. The experimental results show that the proposed gn-SNR method has a powerful capability and robustness of irrelevant genes eliminating for microarray expression data analysis.6) A novel method for feature selection by considering all kinds of genes in the original gene set in multi-step processes is proposed. Though most of above methods can eliminate most of the irrelevant genes effectively, they rarely consider redundant and noisy genes. Firstly, an improved normalized signal to noise ratio (SNR) algorithm is used to remove obvious irrelevant genes. In the second step, an improved support vector clustering algorithm based on k-means (SVC-KM) is applied to eliminate noisy and redundant genes. After obtaining the clusters of genes, we rank all clusters by a recursive cluster elimination method and eliminate the low-ranking clusters. And then, the genes in each cluster are ranked by a special SVM-RFE using the result of clusters rank. The low-ranking genes in a cluster are considered as redundant genes. Finally, a standard SVM-RFE is used to select final informative genes and the approximate optimal solution of feature gene subset can be obtained. We evaluate the method on microarray of Leukemia dataset, Prostate dataset, Colon dataset, Breast dataset, Nervous dataset and DLBCL dataset. The best and mean accuracies of the method by SVM classifier on these datasets for top 60 genes are (100%, 98.04%, 100%, 89.74%, 100%, 98.28%) and (97.59%, 93.45%, 97.16%, 82.84%, 93.17%, 88.74%), respectively. The experimental results show that the proposed method has a powerful capability of feature selection for microarray expression data analysis.7) A novel feature selection method which uses a backward elimination procedure on local support vector machine (Local SVM) to rank informative genes after eliminating redundant genes by clustering is proposed. The method, we name cluster and local SVM-RFE (CL-SVM-RFE), eliminates redundant genes by an improved support vector clustering algorithm. Firstly, for reduce the computational complexity, we apply a filter method, improved normalized signal to noise ratio (InSNR), to eliminate obvious irrelevant genes. Secondly, an improved support vector clustering algorithm based on k-means (SVC-KM) is used to eliminate redundant genes. After obtaining the clusters of genes, a recursive cluster elimination method is applied to obtain the rank of clusters. And then, the redundant genes in each cluster are eliminated by a special SVM-RFE using the result of clusters rank. Finally, we develop a backward elimination procedure on local support vector machine (Local SVM) to rank final informative genes. Local SVM is a local learning algorithm (LLA) which was applied respectively to remote sensing and visual recognition tasks. We evaluate the method on microarray of Leukemia dataset, Prostate dataset, Colon dataset, Breast dataset, Nervous dataset and DLBCL dataset. The best and mean accuracies of the method by SVM classifier on these datasets for top 60 genes are (100%, 98.04%, 100%, 89.74%, 100%, 98.28%) and (97.59%, 93.45%, 97.16%, 82.84%, 93.17%, 88.74%), respectively. The experimental results show that the proposed method has a powerful capability of feature selection for microarray expression data analysis.In summary, we use machine learning and data mining technologies in bioinformatics to research the various issues and propose new methods to solve related problems. Experimental results show that the proposed methods have a powerful capability of related problems. So, application of machine learning methods and data mining to solve biological problems is effective and feasible.
Keywords/Search Tags:Machine Learning, Data Mining, Operon Prediction, Phylogenetic Tree Construction, Microarray Data, Feature Selection
PDF Full Text Request
Related items