Font Size: a A A

Singular Sample Testing, Information Mining And Modeling .qspr / Qsar Molecular Structure

Posted on:2010-03-19Degree:MasterType:Thesis
Country:ChinaCandidate:D S CaoFull Text:PDF
GTID:2208360278969813Subject:Analytical Chemistry
Abstract/Summary:PDF Full Text Request
To meet the scientists' increasing needs of chemical knowledge from large-scale data sets, Chemoinformatics comes in. Chemoinformatics is the application of informatics methods to solve chemical problems. One of the important aims of Chemoinformatics is to obtain some expert knowledge to explain the observed problems. However, the knowledge is always hidden in the huge data sets, which needs some new ideas and methods to mine out.This paper was organized as following:1 : Molecular structure information is encoded by molecular topological indexes which reflect physical and chemical properties of different molecular. The projection pursuit is used to depict molecular structure information visually. Projection pursuit is the numerical optimization of a criterion in search of the most interesting low-dimensional linear projection of high dimensional data clouds, this technique is able to bypass many of the problems of high dimensional by making the computations in a lower dimensional subspace. It is shown that four topological indexes reflect different structure information, which has a part of overlapping. Subspace comparison method of block variables is used to quantify the size of overlapping information.2 : Four models have been built to predict the viscosity of many organic compounds with molecular topological indexes, which have partial least squares, principle component regression, radial basis function network and support vector regression. It is shown that the quantitative relationship of the viscosity and molecular topological indexes has been obtained accurately with four models; moreover, support vector regression can have better performance and obtain smaller predictive error.3 : Aqueous solubility of drug compounds plays a very important role in drug research and development. In this study, three chemometric methods, say partial least squares (PLS), support vector regression (SVR) and back-propagation network (BPN), were developed to model quantitative structure-property relationship (QSPR) for the aqueous solubility of druglike compounds. Molecular descriptors of all drug compounds were calculated with the help of Dragon software. 33 molecular descriptors were used to relate the drug aqueous solubility. It is shown that three models can provide good predictive ability of drug solubility. The predictive ability of SVR was found to be superior to PLS and BP for a model of 225 druglike compounds. The best SVR model established, had an overall R2 of 0.851, root mean square error of 0.542 (RMSEF) for training set and Q2 of 0.810, root mean square error of 0.611 (RMSEP) for validation set, respectively. The prediction results are in good agreement with the experimental values.4 : The crucial step of building a high performance QSAR/QSPR model is the detection of outliers in the model. Detecting outliers in a multivariate point cloud is not trivial, especially when several outliers coexist in the model. The classical identification methods do not always identify them, because they are based on the sample mean and covariance matrix influenced by the outliers. Moreover, existing methods only lay stress on some type of outliers but not all the outliers. In order to avoid these problems and detect all kinds of outliers simultaneously, we provide a new strategy based on Monte-Carlo cross validation, which was termed as the MC method. The MC method inherently provides a feasible way to detect different kinds of outliers by establishment of many cross-predictive models. With the help of the distribution of predictive residuals such obtained, it seems to be able to reduce the risk caused by the masking effect. In addition, a new display is proposed, in which the absolute values of mean value of predictive residuals are plotted versus standard deviations of predictive residuals. The plot divides the data into normal samples, y direction outliers and X direction outliers. Several examples are used to demonstrate the detection ability of MC method through the comparison of different diagnostic methods.
Keywords/Search Tags:QSAR/QSPR, SVM, BPN, PLS, MC, Pursuit Projection
PDF Full Text Request
Related items