Font Size: a A A

Machine Learning Methods And Their Applications In Bioinformatics

Posted on:2011-08-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:T G LiuFull Text:PDF
GTID:1100360332957023Subject:Basic mathematics
Abstract/Summary:PDF Full Text Request
With the development of genome sequencing technologies and analytical technologies of physical structure, we are now faced with an explosive growth of biological sequence and structure data. It is impossible to know all the data based on the conventional biological experiments. Such a gap calls for fast and accurate solutions from bioinformatics. In bioinformatics, researchers try to discover profound biological knowledge by capturing, managing, depositing, retrieving and analyzing biological data. From the perspective of information science technology, the study of bioinformatics is a process from "data" to "discovery". Data mining technology based on machine learning is playing an increasingly important role in the study of bioinformatics and has yielded fruitful results. In this dissertation, we do some researches on machine learning methods and their applications in bioinformatics. The main results can be summarized as follows:(1) In Chapter 2, we propose two methods to predict protein structural class. The first one constructs a k-nearest neighbor classifier using a complexity-based distance measure. This method can bypass the process of feature extraction and avoid the loss of information. Tests on four benchmark datasets show the effectiveness of this method. The second one extends the concepts of traditional amino acid composition and dipeptide composition from the primary sequence to the PSI-BLAST profile (PSSM). Support vector machine (SVM) is used as the prediction engine. Tests on two low-similarity datasets show that this method is very promising to predict protein structural class.(2) In Chapter 3, we propose an approach of improved pseudo amino acid composition to predict subcellular location of apoptosis proteins. This approach extracts sequence features from PSSMs by the auto covariance transformation and applies SVM to perform the prediction. Three widely used datasets are adopted to evaluate the performance of the approach. Results show that our approach achieves relatively high prediction accuracies in comparison with some classical methods.(3) In Chapter 4, based on ATTED-II database, we first construct Arabidopsis gene co-expression network, then propose a subgraph-induced strategy and a graph-clustering approach based on maximum clique to improve the clustering of co-expressed genes, and finally apply four classical motif-finding tools to predict transcription factor binding sites in each cluster of co-expressed genes. Results indicate that the proposed approach is effective and practical.(4) In Chapter 5, we take the model plant Arabidopsis as the research object and apply SVM to predict the regulatory relationships between transcription factors and target genes. The positive and negative feature vectors are composed of gene expression data. Jackknife cross-validation test on the dataset constructed by us shows that our method achieves a high accuracy and may become a useful tool in related area.
Keywords/Search Tags:Bioinformatics, Machine learning, Protein structural class, Protein subcellular location, Transcription regulation
PDF Full Text Request
Related items