Font Size: a A A

Research On Key Techniques Of Protein-protein Interaction Extraction

Posted on:2013-04-17Degree:MasterType:Thesis
Country:ChinaCandidate:Y M SunFull Text:PDF
GTID:2298330392467952Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The amount of biomedical literature increases dramatically because of thedevelopment of the research in biomedical area. Protein-protein interactionextraction is an important way to mine knowledge from biomedical text. Thispaper researches on the key techniques of protein-protein intraction extraction,including biomedical named entity recognition and relation extraction, andnamed entity recognition is the basis of relation extraction. In view of the abovetwo issues, this paper studies three main aspects, and completes a protein-proteininteraction oriented biomedical literature retrieval system.First, this paper utilizes generative model and discriminative model forbiomedical named entity recognition. To capture the long distance dependenciesand the power-law characteristic of natural language, this paper proposes aSequence Memoizer based generative model, and Sequence Memoizer is anonparametric Bayesian model. The experiment results on the JNLPBA2004dataset demonstrate that the model proposed by this paper is more effective thanHMM, and is comparable with Maximun Entropy model. To make full use of therich feature set and large scale training data, this paper utilizes MaximumEntropy model for biomedical named entity recognition. The advantage ofMaximum Entropy model is that it can use effective features, its training time isshort and it is applicable to large scale data set. The result in the CALBC2011biomedical named entity recognition challenge indicates that the proposedmethod in this paper is effective confronting large-scale and low quality trainingdata.Second, this paper proposes an automatic rule learning based protein-proteininteraction extraction method; the rules are generated by the use of dependencyparser. This method automatically learns rules from the results of dependencyparsing and builds a rule library; the prediction process is the rule matchingprocess. The results on the AIMed corpus verify the validity of the proposedmethod.Third, to take advantage of large-scale unlabeled data for protein-proteininteraction extraction, this paper proposes a generalized expectation criteria based semi-supervised method. The method uses the maximum entropy modeltrained by generalized expectation criteria. The experiments on the AIMed corpusverify that the generalized expectation criteria based approach can effectively usea small amount of labeled samples and large scale unlabeled samples, and theapproach is ideal for the biomedical field which is lack of annotated data.Finally, this paper builds an MEDLINE database oriented biomedicalliterature retrieval system. In addition to the regular search function, the systemalso integrates biomedical named entity recognition function and protein-proteininteraction relation extraction function. The retrivel system has a certain practicalvalue.
Keywords/Search Tags:biomedical named entity recognition, protein-protein interactionextraction, Sequence Memoizer, dependency parsing, generalized expectationcriteria
PDF Full Text Request
Related items