Font Size: a A A

Feature Selection And Sample Selection Apply To Cancer Classification And Structure-Activity Relationship Study Of Drugs

Posted on:2015-01-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z J DaiFull Text:PDF
GTID:1220330482470431Subject:Agricultural Entomology and Pest Control
Abstract/Summary:PDF Full Text Request
For the modeling of large-scale data, high-dimensional feature selection and neighbor sample selection model can improve the performance of model greatly and reduce the modeling time substantially, which are necessary steps and effective means for building a classification or regression model. This study optimized models from multiple perspectives such as feature extraction and selection, learning machines and neighboring sample selection, which were then applied to cancer microarray data analysis and Quantitative Structure-Activity Relationship (QSAR) study of drugs.Firstly, to overcome the defects of traditional F-test or top scoring pairs family algorithms which only considering one-way comparison and ignoring interactions between genes, we compared multiple genes from two directions based on unbalanced two-way analysis of variance, integrally considering the interactions between genes and phenotypes, and then obtained informative genes by integrated weighted ordering and elimination of redundancy. Combined the ideology of transduction inference, we constructed immediate classifier without training process. The results of multi-angle comparisons for 10 multi-class cancer expression datasets showed that:1) the new method achieved the best average prediction accuracy (92.06%) using quite a few informative genes compared with other reference models; 2) superior to top scoring series and correlation-based gene selection algorithms; 3) the performance of our classifier is comparable to Support Vector Classification and is better than Linear Logistic Regression and Naive Bayes. For the Leukl and Breast datasets, multiple gene selection was conducted and we analyzed biological pathways using Gene Ontology, several important biological processes were found and we also analyzed the function and regulated pattern of critical genes.Secondly, aiming at the drawback of analysis of variance that cannot be used for feature selection of regression data, Binary Matrix Shuffling Filter (BMSF) was applied to the QSAR study of antitumor drugs in RPMI8402 and P388 cell lines. Specifically, using quantum chemistry calculation software PCLIENT to extract 2923 high-dimensional molecular descriptors, BMSF was performed to select features, and then Support Vector Regression (SVR) was used for modeling and prediction. The results showed that:the SVR model with reference descriptors was superior to the Multiple Linear Regression, Stepwise Linear Regression and Partial Least Squares regression models, and is comparable to Artificial Neural Network; for high-dimensional descriptors spaces,11 features were obtained after feature selection, SVR models with reserved descriptors were better than the reference models; the non-linear regression of SVR is highly significant, the importance of most of reserved descriptors achieved significant levels, and the effect analysis to the drug activity provided ideas for designing highly active antitumor drugs.Further, considering feature selection and sample selection simultaneously, the BMSF and Geostatistics semivariogram were applied to QSAR analysis of Angiotensin Converting Enzyme inhibitors and HLA-A*0201 binding peptides. Specifically,531 physicochemical properties of amino acids were used as descriptors to characterize the primary structure of peptides; BMSF was conducted to select features; a common range was determined using Geostatistics based on the reserved weighted descriptors; for each sample to be predicted, selecting the neighboring samples from training set, whose distances between samples are less than the common range; the QSAR model was established using SVR with the weighted, selected features as well as the exclusive set of neighbor training samples; prediction was conducted for each test sample accordingly. The results showed that:the 1593 and 4779 high-dimensional descriptors sharply reduced to 15.4 and 15.8 in average after conducting feature selection, and the external predictive indicators Q2pred in both applications are 0.982 and 0.806 respectively, which were superior to references and the models using single selection methods. The distribution and preference at different residue positions for multiple descriptors subsets were analyzed, which can support theoretical guidance for drug molecular design with high activity.The methods proposed in this thesis have extensive application prospect in the fields of biomarker screening, pattern classification and molecular activity prediction.
Keywords/Search Tags:High-dimensional feature selection, Neighboring sample selection, Cancer classification, Drug molecules, Quantitative Structure-Activity, Relationship
PDF Full Text Request
Related items