Font Size: a A A

Classification Methods In Data Mining And Their Applications To Mass Spectral Data

Posted on:2006-08-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:P HeFull Text:PDF
GTID:1100360155463757Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
An important goal of data mining in chemistry is to try to extract useful information from databases, and then classify and recognize the compounds or medicines by their related molecular structure, topological index or chemical fingerprints. With the growth of chemical measurement and modern information technology, more and more huge databases containing a large amount of chemical compounds information are established, such as spectral databases, chromatographic databases, or databases on molecular structures and their substance properties. How to discover knowledge hidden in huge collections is a big challenge. Our work is mainly on the research of methodology and application of classification methods in huge data sets. In general, the classification methods which are introduced and proposed in this thesis can be applied to various classification problems. Here, we focus on the classification methods and applications in analysis of mass spectra. Mass spectrometry, an instrumental technique which is used to character and identify chemical compounds, produces large amounts of valuable data for chemical structure elucidation. Identification of compounds or automatic recognition of structural properties from mass spectra (MS) data is an important work in chemometrics. In this thesis, we first introduce different of classification methods based on classical multivariate data analysis, artificial intelligence or modern data mining techniques. These methods have been applied successfully to some extent in the automatic recognition of substructures or other structural properties form MS data. However, there are still many substructures which can not be recognized efficiently by existing classifiers. So seeking better techniques for mass spectral pattern recognition has being a mission in chemometrics.In this thesis, I propose a new approach combining classification tree (CT)with sliced inverse regression (SIR) and apply it to the classification of mass spectra. Classification tree has been used to generate classifiers from MS data because of its powerful ability in automatic variable selection and automatic interaction detection. However, it is often weak on presenting linear and global relationships among variables. If the output depends on inputs through some linear combination of input variables, the classification tree can not capture the linear combination effect and lead to a low accuracy. SIR is an effective method to find useful linear combinations of predictor variables to regress the response. So merging CT and SIR harmoniously can inherit both advantages of them. Experiments show that the proposed approach can improve classification accuracy of decision tree and get better result than other classical classification methods.Boosting is one of the most important recent development in classification methodology and has been successfully applied to many different fields, but it is almost unknown in chemometrics. In this thesis, we apply boosting neural network and boosting tree to classification of chemical data. Experimental results show that boosting can significantly improve the prediction performance of any single classification method. In one experiment, mass spectral data have been taken into account for the classification of 15 substructures. We apply boosting tree to classify this group MS data. The performance of boosting is very encouraging. Compared with previous result, boosting significantly improves the accuracy of classifiers based on mass spectra. Two techniques to interpret the model are also introduced in order to help us for better understanding experimental results.Finally, we propose a generalized boosting algorithm via Bayes optimal decision rule. Boosting works by sequentially applying a classification algorithm to reweighted versions of the training data, and then taking a weighted majority vote of the sequence of classifiers thus produced. Using Bayes optimal decision rule, we adjust the weights of the sequence of classifiers in the voting process of boosting algorithm. The two types of errors are introduced into thegeneralized boosting and make the voting process more sensible. Meanwhile the weights of the training samples are also correspondingly adjusted according to some criterion. The generalized boosting is applied to a two-class classification problem with chemical data. Experimental results show that it can improve the prediction accuracy compared with AdaBoost algorithm especially when the difference between the two types of errors for classification is large.
Keywords/Search Tags:Boosting, Chemometrics, Classification of mass spectra, Classification tree, Data mining, Sliced inverse regression
PDF Full Text Request
Related items