Font Size: a A A

Research And Construction Of Protein Named Entity Recognition System

Posted on:2006-07-25Degree:MasterType:Thesis
Country:ChinaCandidate:W X WangFull Text:PDF
GTID:2178360182483351Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In order to understand biological process, we must clarify howbiomedical substances, namely proteins, interact with each other.Unfortunately, an enormous number of information related with biomedicalare buried in millions of scientific literatures, and accumulating. As achallenging task in bioinformatics, rediscovering them, an automatic efficientprocessing with intelligent information extraction is needed.Since last year (2004), our laboratory starts bioinformatics researchproject on protein-protein interaction (SPIES). SPIES, a mature system forprotein-protein interactions, has only captured interaction between proteinsthrough automatic-generated patterns. Hence, a support system from proteinnamed entity recognition is urgently needed.We present Named Entity recognition system for Protein names(Ne4Pro), an automatic system that extract protein names from the biologicalliterature and link them to the associated entries in sequence database.The extraction system in Ne4Pro is divided into several tasks. The firsttask is the named entity identification, which is to identify name andnon-name part from biomedical literature. The second task is the named entityboundary fixation, which we provide attractive expansion and shrinkagemethod to capture the right boundaries of named entities. Then, we takeidentified named entities into the last task semantic classification, where nameentities are classified into protein and non-protein name class. Each of thetasks is carefully created based on expert knowledge on the nomenclature ofprotein names as Ne4Pro, a novel system that integrates methods fromdictionary-based, rule-based and machine learning-based.The main contributions of this thesis are: first of all, the last task of oursystem, a novel semantic classification, which integrates knowledge-based,dictionary-based, and machine learning-based method, allow us to achievehigh performance improvement over an independent baseline semanticclassification. This task shows that knowledge-based model and curation ofdictionary is important in order to reduce ambiguity of classifier. Secondcontribution is a novel boundary fixation method task, considered to be moreaccurate than just detecting longest named entities. Third contribution is anew word shape feature, which proposed to overcome the limitation ofdictionary through imitating their word shapes. In addition, a unigram/bi-gramthat is of statistical information-rich with rule based class smoothing methodis also introduced. Last but not least is the design of each task which plays animportant role of the system performance.We use GENIA corpus 3.02 to conduct 10-fold-cross validationexperiments. To achieve desirable performance in our system, we used SVMas a machine learning approach which has shown the best performance invarious Biomedical Natural Language Processing task. Unlike most previousdeveloped systems, we are not using longest named entity annotations as ourperformance evaluation due to we want to find more precise named entitiesthan just to find longest named entities. Our experiment shows that proposedboundary fixation task capable to improve the performance of system by 6.7%in precision and 9.3% in recall, while our semantic classification taskperforms better performance improvement over baseline semanticclassification, and other similar systems.
Keywords/Search Tags:Information Extraction, Protein Names, Boundary Fixation, Word Shape Features, Integrated Strategies
PDF Full Text Request
Related items