Font Size: a A A

Research On Technologies In Protein Protein Interaction Text Mining Based On Discriminative Models

Posted on:2012-11-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:W Z QianFull Text:PDF
GTID:1480303359958909Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Protein-protein interactions describe the interacting relationship between proteins and are valuable for biomedicine in theories and applications. With the development of biomedicine, the way of manually obtaining protein-protein interaction information from literature is not adaptive to the rapid increasement of biomedical literature. Text mining technology can automatically discover knowledge from text so that it is widely applied in the task of extracting protein-protein interactions. Traditional approaches are insufficient to meet the requirements of reality in protein identifying and interaction extraction performances. In the other hand, the reliance to labeled corpus limits the performances of algorithms. In order to solve these problems, based on the discriminative model in machine learning, this dissertation consist of two tasks of protein name entity identification and protein-protein interaction information extraction.The main innovative results including four parts are as follows:1. Based on the conditional random fields model, a protein named entity identifying method combining feature selection and post processing is proposed. On the basis of extracting features, the traditional word feature approach is extended. The new modules include: feature selection based on information gain and boundary tuning rules based on part of speech and word filtering method in post processing. The experimental results show that this method can be more adaptive than traditional approaches to the identification tasks of named entity in complicated definition mode.2. A protein-protein interaction information extraction model based on model essembling is proposed. Based on the idea of cascade generalization, the results of pattern matching are taken as features and incorporated into bag of words method so that the model has both advantages of pattern learning and bag of words method. In pattern learning, single pattern evaluation method is improved and pattern evaluation method based on performance gain is proposed to drop poor or redundant patterns effectively. The experimental results show that in contrast to individual approaches, this method can improve classification performances remarkably with more balanced precision and recall. 3. A protein-protein interaction information extraction method essembling shallow parsing is proposed. Complicated grammar structures in biomedical literature results in low information extraction performances. Before information extraction, input sentences are processed by chunk parsing, appositive parsing, coordinative parsing and clause analysing so that candidate protein pair instances are divided into different grammar units. The division of grammar units can limit searching scope of protein pairs to improve classifying precision. Experimental results show that in contrast to traditional machine learning based approaches, this method can remarkably improve F1 value over 10%.4. Bag of words and automatic pattern learning approaches are applied into co-training and an automatic instance labeling approach based on k-nearest neighbor algorithm(kNN) is proposed. On the condition of lacking labeled samples, by applying co-training framework, bag of words and pattern learning approaches can learn and complement from each other. Based on kNN, by defining the sequence alignment scores of protein pair text between individual samples as distance values, raw instances are labeled automatically. Experimental results show that on condition that there are few labeled samples, both of these approaches can utilize unlabeled samples effectively and improve information extraction performances remarkably.
Keywords/Search Tags:discriminative model, protein-protein interaction, information extraction, pattern learning, semi-supervised learning
PDF Full Text Request
Related items