Font Size: a A A

Researches On Data Mining Modeling Theories And Its Applications In Bioinformatics

Posted on:2007-12-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:H B ShenFull Text:PDF
GTID:1118360305956424Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
In the past decades, large amount of data is obtained with the fast development of science, economic and society. How to find valuable knowledge and rules from these data is a critical problem. Data mining researches are proposed to solve this problem, which combines statistics, database, machine learning techniques etc.Clustering analysis is one of the most important research areas in data mining. In the real world, we often have to deal with the high-dimensional dataset, in which, different attributes will contribute differently to each cluster in most cases. Considering such a problem, a kind of attribute weighted fuzzy kernel clustering algorithm is proposed. This new kernel clustering algorithm can reflect properly the attribute importance for each cluster and hence can yield much higher clustering accuracy than the conventional clustering algorithms. Another thing we often encounter in the real world is that one dataset is independent of others but also cooperate with others at the same time. Based on such cooperative constraints, new information based collaborative clustering algorithm is proposed. Such collaborative clustering algorithm considers the influence from other datasets and the corresponding clustering results will be more flexible. Eyes are the main organs that human use to group objects and find the important inherent relations between the objects. So, designing the clustering methods thru simulating the visual systems will help to solve some basic problems with the conventional clustering algorithms. By simulating the uneven sampling mechanism of human eyes, a new visual clustering algorithm is proposed, which will provide some new ideas in the clustering analysis researches.With the fast development of biology science, we are now faced with an explosion of biology data. It is impossible to know all the data based on the conventional biology experiments. Such a gap calls for fast and accurate solutions from bioinformatics. Bioinformatics is a new and hot research area, combining computer science and biology science. Many data mining techniques, such as clustering analysis, have been used to analyze the biology data. We have proposed to use"supervised clustering algorithm"to predict protein structures, which is demonstrated a better choice than the"unsupervised"method because it incorporates the class label information in the training dataset. In the proteomics researches, one of the important steps is to discrete the protein sequence. The so-called pseudo amino acid composition (PseAA) is demonstrated to be more effective than the conventional 20-D amino acid (AA) composition because PseAA includes more sequence order information. However, how to select the dimension of PseAA is a critical problem and in the past researches it was determined by trial and error methods. Such dimension selection step is very unconstant and for different applications and algorithms, we will have different selections. We proposed to use ensemble classifier to solve this problem. Ensemble classifier fuses many independent classifiers, working in different dimensions. Further experiments show that much higher prediction accuracy can be obtained in most cases because ensemble classifier can catch the cores from different sides. Ensemble classifier is a very effective and flexible method to solve the dimension selection problem in PseAA composition.In the past few years, with the development of life science, Gene Ontology (GO) database was constructed and became more and more important in life science researches. Based on the GO discrete model, we have developed several predictors for the prediction of the protein subcellular locations for human, plant, bacteria etc. Various experiments have demonstrated that GO discrete model is a kind of higher level discrete model and better prediction accuracy is observed accordingly. Furthermore, the benchmark datasets constructed cover the most subcellular locations in this literature till now, which can greatly improve the practicability of the developed predictors. In order to simplify the use of our bioinformatics research fruits, several on-line bioinformatics prediction web servers are constructed, which can be accessed thru internet. Till now, many biology scientists from USA, England, Holland, Australia and China etc have visited and used our web servers. We believe that such a kind of easy-to-use web servers will promote the development of biology science greatly.New contributions of this paper are as follows:(1) New attribute weighed fuzzy kernel clustering algorithm is proposed and its convergence property is theoretically proved;(2) Collaborative clustering model is discussed, which is a very important model in the real world;(3) By simulating the mechanism of human eyes, a kind of a new visual clustering algorithm is proposed;(4) We propose the"supervised"clustering algorithm to predict the protein structures;(5) Ensemble classifier is proposed for proteomics researches, which can effectively solve the dimension selection problem in the PseAA discrete model. Various experiments demonstrate the feasibility of ensemble classifier;(6) Several predictors are developed based on GO discrete model to predict the protein subcellular locations for human cell, plant cell, bacteria cell etc. The stringent benchmark datasets constructed in this paper cover the most subcellular locations in this literature till now;(7) The simple-to-use on-line bioinformatics web servers are constructed, which can greatly promote the development of life science.
Keywords/Search Tags:Data mining, Clustering analysis, Bioinformatics, Machine learning, Fuzzy-C-Means, Information theory, Sampling theory, Evidence theory, Ensemble classifier, Protein structure prediction, Membrane protein type recognition, Cellular network
PDF Full Text Request
Related items