Font Size: a A A

Research On Protein-protein Interactions Extraction Methods Based On Biomedical Text Mining

Posted on:2018-08-18Degree:MasterType:Thesis
Country:ChinaCandidate:Z D BaoFull Text:PDF
GTID:2310330512486873Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid growth of the number of the literature in biomedical field in recent years,the required biomedical knowledge is obtained by data mining technology from biomedical literature has become a hot research topic in the bioinformatics field.A basic and important way the proteins achieve its function is through the interaction between proteins(Protein-protein interaction,PPI).However,a large amount of PPI information is recorded as the form of unstructured data in biomedical literature,the way of reviewing PPI information of biomedical literature artificially is great time-consuming.Thus,text mining technology is used to extract and anlysis the protein-protein interaction relationship in biomedical literature would be useful for extracting PPI accurately.Current research on PPI relationship extraction regard PPI extraction from biomedical literature as a binary classification problem.In the PPI extraction task,based on statistics and machine learning algorithms are usually used.The feature of biomedical text is extracted to generate feature vector to construct classification model and obtain well performance.Now the approach used in PPI extraction research is usually supervision mehthod,these methods need a great amount of labeled PPI data to build a classification model in biomedical field.In order to reduce the requirement of counstructing classification model for labeled data,in this paper,we study from the following two aspects:1.PPI extraction based on the distant supervision and transfer learningThe PPI dataset would be classified is regared as target domain dataset,relation extraction model was constructed by using transfer leanring to transferring the knowledge of different distribution source domain dataset to reduce the requirement of the target domain labeled PPI samples.This study build the auxiliary dataset based on distant supervision as the source domain PPI set.First,the PPI data as the relation knowledge base was downloaded from the IntAct PPI database,the biomedical abstracts as the primitive corpus was downloaded from PubMed database,the PPI pairs in knowledge base were mapped with the biomedical sentences in corpus,positive samples and negative samples were labeled according to whether or not the mapping exists.The transfer learning algorithm TrAdaBoost based on instance was used to build classification model on source data and part target PPI data.The results in three standard datasets demonstrates that the classification model has better perofrmance which was built using the auxiliary dataset constructed by distant supervision based on Tr AdaBoost.2.PPI extraction based on distant supervision and transfer learning in PU scenarioIn practical applications,the PPI data was a few labeled.Due to the restriction of the experimental conditions,many of the existing PPI were not sure whether have interactions,this part of data can be regarded as unlabeled data,only a small amount of PPI exist interaction after experiment validation,this part of data can be regarded as the positive samples.In this case,the traditional supervised algorighm will not be able to build efficient classification model to classify the PPI data.This study proposed the PPI extraction algorithm based on transfer learning and distant supervision in PU scenario.The target PPI feature information was collected and the knowledge was transferred by adding weights for the source PPI samples using the data gravitation,the PU learning algorithm TPAODE was constructed in that weighted source data by static classifier integration technology.The experiment results show that,the classification model was built on the unlabeled target PPI data and a few labeled source data using the TPAODE algorithm proposed in this study had better performance than traditional PU algorithms.To further reduce the requirement of labeled data in PPI extraction task,the auxiliary dataset constructed by distant supervision was regarded as source data,classifier was built based on the few labeled source data and unlabeled target data had better performance than the existing PU learing algorithms PNB and PTAN.
Keywords/Search Tags:protein-protein interaction extraction, transfer learning, positive unlabeled learning, distant supervision
PDF Full Text Request
Related items