Font Size: a A A

Multi-label Prediction Model Based On Ontology Database And Data Mining In Bio-medicine

Posted on:2018-07-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:X ChengFull Text:PDF
GTID:1318330566952304Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
While information,concepts,and terminology are growing at an exponential rate,ontology database is more and more popular.Ontology database unify the scientific terminology and complete the seamless connection between the old and the new knowledge.GO and Ch EBI are two typical ontology database.The GO database is a large database describing biological functions of gene products.The GO project was initiated by biologists who asked for a unified description of gene products in different fields in order to make new gene products easier and faster to annotate.Ch EBI is a free downloadable molecular dictionary dedicated to the study of small chemical entities.The molecules encoded by genomics,such as nucleic acids,proteins and polypeptides cut by proteins,are not included in Ch EBI,because these data are abundant in other databases.Extracting the information from the ontology database and creating a prediction model for biological information are very helpful for improving the performance of the model.There are two kinds of methods used in the previous prediction models: classification,graph structure similarity and information content similarity.However,due to the rapid growth of data and the development of artificial intelligence technology,the old prediction model can no longer meet the requirements of researchers.Based on these two kinds of ontology database,using probability statistics Bayes method to construct protein new feature vector,using multi label machine learning algorithms,protein subcellular location and ATC classification of drugs are predicted.The validity of the proposed prediction model is proved by the Jacknife cross validation method and comparing with the latest prediction model.The following is the main work and innovation of this dissertation:(1)A new multi label algorithm ML-GKR.Multi label classification is another kind of issue in machine learning.It belongs to the special case of multi classification problem.One or more target tags may be assigned to each instance in multi label classification.ML-GKR evolved from the Gauss kernel regression algorithm with the addition of multi label components.In terms of algorithm complexity and memory space complexity,ML-GKR outperforms ML-KNN and Rank SVM algorithms.In the ATC drug multi label classification model research,through comparing the MLGKR algorithm and the classical multi label algorithm ML-KNN and Rank SVM,I found that in addition to Aiming,ML-GKR is the best in the other four performance indexes.Especially for the most important multi-label algorithm measurement Absolute True,ML-GKR algorithm reached 60.98%,which is 6.82%,25.21% higher than ML-KNN and Rank SVM respectively.For the Jacknife test running time,ML-GKR spend only 0.3% and 0.3% the same time as ML-KNN and Rank SVM.The CPU processor(x64)4*2.6G computer operating system,Windows7,4G of memory,running time of ML-GKR algorithm in Jacknife test is 3 minutes.(2)Multi-label ATC drug classification model.Given a drug compound,we extract compoundcompound interaction information,compound-compound structure similar information and similar molecular fingerprint information as the sample characteristics.ML-GKR a novel multi label algorithm is applied to predict ATC class and its drug active ingredients,drug therapy,and chemical properties.The advantage of this classification model is that this web site is the first implementation of multi label ATC drug classification model web site,which is more consistent with the actual ATC drug.The previous drug ATC classification site is a single label,meaning that a drug can only be labeled as one of the fourteen ATC classes.However,an ATC drug may belong to multiple ATC classes at the same time.Through the study of our drug ATC classification model i ATC-m ISF,we found that the model was important for the re-utilization of the old medicine.We used the network prediction model to predict 3883 drugs,and 1229 of the drug samples were predicted to be false positives.Based on the analysis of false positive results of some drugs,we found that false positives may be helpful to the drug reuse and redevelopment.(3)Multi label ATC drug classification model based on Ch EBI database.According to the information in Ch EBI database,we try to extract the feature vector of the drug.A new model called OIPM was developed.Ch EBI describe the differences between languages to quantifying the similarity between two compounds.OIPM,compound-compound interaction information,compound-compound structural similarity information and molecular fingerprint similarity information are combined together to further improve the prediction accuracy of drug ATC class.A multi label ATC drug prediction model i ATC-m Hyb is developed.According to the language description of a compound's characteristic function,its treatment,drug and chemical properties are predicted.It is an important and challenging problem because its predictions will be helpful to the development and utilization of novel drugs.(4)Multi label animal protein subcellular location prediction model based on GO.The model p Loc-Animal is based on the latest animal subcellular location prediction model i Loc-Animal.The web site address is: http://www.jci-bioinfo.cn/p Loc-m Animal/.The GO vector of i Loc-Animal animal proteins is based on the GO term frequency method.This method counts the frequency of each annotated protein and the GO term of the homologous protein,and uses the frequency number to represent the protein characteristic vector.The protein characteristic dimension of i Loc-Animal model reaches 3043 dimensions.The model p Loc-Animal uses the method of feature vector dimensionality reduction based on Bayesian statistics.First of all,it is based on Naive Bayesian classifier,variables are assumed to be independent between each sample,and according to the Bayes probability statistics for each GO term,its subcellular location correlation is calculated.Finally,the maximum correlation value is calculated as the characteristic vector of protein.Bayes dimensionality reduction method reduces the dimension of GO feature vectors in i Loc-Animal,and reduces the dimensionality of protein feature vectors to 20 dimensions.multi label classification algorithm ML-GKR is used.By integrating GO feature vector and Grey-PSSM matrix eigenvectors of the new data set,the model is constructed to predict the subcellular location of p Loc-Animal protein.The absolute accuracy rate of the p Loc-Animal prediction model is 0.6193.Compared with the model i Loc-Animal,the absolute value is improved by 0.16,and the performance is improved by 35%.Absolute accuracy is one of the most important metric measures of multi label algorithm.The running time of the prediction model is greatly reduced due to the reduction of the feature dimension of the data.A(x64)4*2.6G computer with Windows7 operating systems adn 4G memory will take more than a month to complete Jacknife testing for i Loc-Animal.With Jacknife test on the same hardware platform and operating system,the animal protein subcellular location prediction model p Loc-Animal spend only 2 minutes.At the same time,it is possible to optimize the ML-GKR parameters because the running time of the prediction model is greatly reduced.(5)GO based model for predicting subcellular localization of multi-label plant proteins.A multi-label plant subcellular localization prediction model p Loc-m Plant was created,and the access address of the network server was http://www.jci-bioinfo.cn/p Loc-m Plant/.The difference in the number of proteins between the plant protein datasets is very large.Only 21 of the 978 proteins belong to the Golgi,apparatus and peroxisome classes.However,286 belong to the chloroplast class.It is also noted that the positive and negative sample ratios of Golgi,apparatus,and peroxisome classes are close to 1:50.Aiming at the imbalance of plant protein datasets,two new multi-label algorithms,EML-GKR-1 and EML-GKR-2,are constructed.Among them,the EML-GKR-1 algorithm uses the imbalance sampling and EML-GKR-2 sampling algorithm uses the cost sensitive method.The prediction model uses protein GO feature vectors to represent protein features.Three kinds of multi-label algorithms(ML-GKR,EML-GKR-1 and EML-GKR-2)are used to predict the subcellular location of plant proteins.Compared with the latest plant protein subcellular location prediction model,the performance of the new ensemble algorithm is better.
Keywords/Search Tags:bioinformatics, imbalance multi-label classification, ensemble multi-label classification, subcellular localization, multi-label nuclear regression model, drug ATC classification, bayesian statistics dimensionality reduction
PDF Full Text Request
Related items