Font Size: a A A

Research On Tree Kernel-based Protein-Protein Interaction Extraction

Posted on:2015-01-14Degree:MasterType:Thesis
Country:ChinaCandidate:N LiuFull Text:PDF
GTID:2250330428967671Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
In recent years, biomedical technology has been developing by leaps and bounds and the relevant research results and academic reports have also been emerging in endlessly. Although the advent of the internet era makes it easy for people to find the information from the internet, most of the information is still buried in the vast amounts of biomedical literature. It is difficult for researchers to obtain useful information timely and effectively only by reading these documents manually, so the biological text mining technology appears. In the field of biological text mining, the protein-protein interaction extraction is the most closely watched. The reason is that protein is the indispensable material basis of all vital movement, and understanding the interaction relationship is helpful to systematically understand the molecular mechanisms of vital movement and has a positive role in promoting the treatment of diseases and the development of new drugs.The rule-based method is widely used in the early research of protein-protein interaction extraction. This kind of method is time-consuming and the effect of the system overly depends on the quality of rules and has poor portability. Now the machine learning method has been widely applied which can be divided into feature vector method and kernel method. The feature vector method can’t avoid the complex construction and mapping process of feature vectors, so the kernel-based method is the current mainstream method. However, the existing kernel functions are mostly based on dependency information, and few studies have adopted kernel based on constituent parse tree to extract the protein-protein interaction. In fact, the rich syntactic and structured features inherent in constituent parse tree are important for protein-protein interaction extraction. The shortest dependency path-directed constituent parse tree is one of the few algorithms based on tree kernel.The shortest dependency path-directed constituent parse tree algorithm uses the shortest dependency path between two proteins of a sentence to direct the cutting of constituent parse tree, but the syntax tree generated in this manner is still not concise. The reason is that the shortest dependency path has noise interference brought by the appositive dependency relation. The noise information not only increases the complexity of the syntax tree representation, but also hampers the recognition and judgment of a classifier. In order to resolve this problem, an effectively optimizational path-directed constituent parse tree is proposed in this paper. Some relevant processing rules are defined to remove the appositive and eliminate the noise information that is unhelpful for protein-protein interaction extraction. Finally, the optimizational path is used to cut the constituent parse tree. The experimental results show that the improved effectively optimizational path-directed constituent parse tree algorithm is effective in improving the protein-protein interaction extraction performance.Through analyzing the incorrect classification results of the shortest dependency path-directed constituent parse tree algorithm on five commonly used corpora, it can be found that the verbs denoting interaction behind the modal verb phrases are easy to be left out by the shortest dependency path. This leads to the failure of the generated syntax tree in completely expressing the protein relationship instance. In order to resolve this problem, an effectively optimizational and expanding path-directed constituent parse tree is put forward on the basis of effectively optimizational path-directed constituent parse tree algorithm. Some relevant processing rules are defined to add the missing verbs expressing protein-protein interaction to the shortest dependency path and the processing of the appositive dependency relation in effectively optimizational path-directed constituent parse tree algorithm is also combined to ensure the integrity and simplicity of the constituent parse tree directed by the effectively optimizational and expanding path. The experimental results show that the effectively optimizational and expanding path-directed constituent parse tree algorithm further improves the performance of protein-protein interaction extraction.
Keywords/Search Tags:protein-protein interaction extraction, tree-kernel, appositive dependencyrelation, modal verb phrase
PDF Full Text Request
Related items