Font Size: a A A

Research Of Protein-Protein Interaction Extraction Based On Rich Feature And Multiple Kernels Learning

Posted on:2012-10-14Degree:MasterType:Thesis
Country:ChinaCandidate:M H JiFull Text:PDF
GTID:2210330368488752Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
As the quantity of biomedical literatures is increasing rapidly, it is difficult for biomedical researchers to find useful information by reading a mass of literatures. Thus, automated information extraction from biomedical literature becomes a task of significant interest in the biomedical natural language processing (BioNLP) field, which can help researchers to find useful information from large literatures accurately and efficiently. Automatic extracting protein-protein interaction (PPI) from biomedical literatures is a subtask of biomedical text mining field, which can help to build protein knowledge network, predict new interaction, discovery new drugs and so on.In recent years, the technique based on statistics machine learning has been used to extract PPI. And that feature representation is one of the most important issues in statistics machine learning, which has big impact on the performance of systems. This paper focuses on feature representation for PPI extraction, and the multiple kernels learning method is used to combine different types of effective features for PPI extraction.In this paper, we first introduce the research background and related knowledge of PPI extraction, and then introduce support vector machine, syntax parse and evaluation metrics. Second, we present a method based on rich feature for PPI extraction. The method extracts not only bag-of-words, but also N-grams, position, special position and word distance feature from context, along with two syntax feature, sentence distance and predict argument structure feature. These features are evaluated on five publicly PPI corpus, and the results show that them help to improve the performance of PPI extraction. Finally, we present an approach to combine useful features explored from a sentence and its dependency graph by using a combination of kernels for PPI extraction. Kernel-based methods have been employed to provide appealing solutions for learning rich structural data, which cannot be easily expressed via the flat features. In this paper, we first define three kernels, namely feature-based, graph and walk-weighted subsequence kernel, which can extract useful features from sentence and its dependency graph. Then the multiple kernels learning are used to combine the three appropriately weighed kernels for PPI extraction. The multiple kernels learning can retrieve the widest possible range of useful information, and can reduce the risk of missing important information, obtaining 62.4 F-Score and 87.2 AUC on the AImed corpus, which is comparable with the state-of-the-art approaches.In this paper, we use feature-based and kernel statistic machine learning method to explore effective feature from the sentence context environment and the syntax structure for PPI extraction. Methods are evaluated on five publicly PPI extraction corpora, and the results show that our method is very good generalization performance, and achieve competitive results on five corpora.
Keywords/Search Tags:Protein-Protein Interaction Extraction, Natural Language Processing, Machine Learning, Feature, Kernels
PDF Full Text Request
Related items