Font Size: a A A

Research On Biological Sequence Classification Based On Machine Learning Methods

Posted on:2010-01-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y YangFull Text:PDF
GTID:1118360302466636Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Over the past few decades, the machine learning methods have obtained great moti-vation of development in the realm of bioinformatics, and become an important means to solve biological problems. In bioinformatics, gene recognition, function cite/signal recog-nition on DNA sequences, and protein sequence feature analysis, all need machine learning and pattern recognition techniques. In this thesis, we focus on two key problems in pattern recognition, namely feature extraction and pattern classification, to analyze and classify bio-logical sequences including protein and DNA sequences, for dealing with a series of biologi-cal problems, i.e., protein subcellular localization, protein homology searching, prediction of the proteins secreted by type III secretion system and prediction of novel non-coding RNAs.The major contributions of the thesis are:1) Inspired by the word segmentation techniques in Chinese natural language process-ing, we proposed a new protein sequence feature extraction method. We selected subse-quences with statistical significance from the protein sequences, segmented the amino acid sequences into non-overlapped words, and extracted the features of protein sequences by counting the frequency of each word. Compared with traditional amino acidκ-mer fre-quency method, the proposed method has the advantages of lower dimensionality and higher accuracy. We applied it to protein subcellular localization and protein family classification, and obtained good results.2) Considering the low sequence similarity and unstable structures of the proteins se-creted from the type III secretion systems, i.e., effectors, we for the first time utilized protein secondary structure, solvent accessibility and amino acid composition information to predict unknown effectors. We performed cross validation on Pseudomonas genome and obtained high accuracy. Moreover, we predicted all the effectors of four strains of Rhizobium. Com-bining with promoter pattern matching, we obtained a number of new type III secretion effectors.3) For the class imbalance and multi-localization problems in protein subcellular local-ization, we used min-max modular support vector machines to solve the multi-label imbal-ance problem. Compared with traditional support vector machines, the modular classifier improved both total accuracy and class average accuracy. At the same time, this method speeded up the training time greatly, which is suited for large-scale data sets.4) We proposed a new task decomposition method based on biological domain knowl-edge, namely taxonomy and Gene Ontology information, for the min-max modular support vector machines. The new decomposition method has more stable performance and higher accuracy than random decomposition and other decomposition methods.5) Based on the comparative genomic method, we extracted intergenic regions from multiple plant genome sequences, and obtained conserved sequence segments through se-quence alignments. We conducted prediction on these segments, and carried out a series of screening steps, and finally obtained 21 new non-coding RNAs, which can be grouped into 16 families. These new ncRNAs have been verified through wet-bench experiments for their ability to express.
Keywords/Search Tags:Bioinformatics, Biological Sequence, Feature Extraction, Pattern Classification, Min-Max Modular Network, Support Vector Machines, Task Decomposition, Protein Subcellular Localization, Non-Coding RNA
PDF Full Text Request
Related items