Font Size: a A A

Coreference Resolution In Biomedical Texts

Posted on:2015-05-31Degree:MasterType:Thesis
Country:ChinaCandidate:J ZhangFull Text:PDF
GTID:2298330467487075Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Currently, the number of biomedical literatures is growing exponentially. There is abundant knowledge in these biomedical texts, which plays an important role for the scientific research, teaching and practice in the biomedical field, the diagnosis, prevention and therapy about diseases, and the development of new drugs. Therefore, that the valuable information is extracted efficiently from these massive literatures has become more and more popular. The coreference resolution task is the base of gaining these valuable information and influences the performance of the information extraction to a great extent. In this thesis, the coreference phenomenon in the biomedical texts is researched. According to the system framework, there are two steps adopted.The first step is to extract candidate anaphors and antecedents in the development data. The performance of this step plays an important role for the coreference resolution. This thesis adopts different rules for the pronoun anaphor extraction and the noun phrase anaphor extraction respectively. For the pronoun anaphors, a pronoun list is created firstly and all words in the list will be extracted as the candidate anaphors. Secondly, the spurious anaphors "it" and "that" are pruned according to the output of the Enju parser and rules respectively. The precision has been improved significantly. For the noun phrase anaphors, the rules are used to extract and prune the noun phrase anaphors.The second step is to resolve the coreference for the anaphor in the biomedical texts. There are two methods adopted:the single machine learning method; the hybrid method.In the single machine learning method, according to the characteristics of the biomedical corpus, the feature selection is conducted for the pronoun coreference resolution and the noun phrase coreference resolution respectively, instead of using the features of the general domain. The resolution result for all anaphors is49.36%F-score higher than the current machine learning method with the general domain features by10.06%. This shows that the single machine learning method with the feature selection for different anaphor types is effective.In order to improve the performance, the hybrid method is adopted and the different methods are adopted for different anaphor types (the relative pronoun; the non-relative pronoun; the noun phrase). For the relative pronoun resolution, the composite method that combines the machine learning method and the rule-based method is adopted. For the non-relative pronoun resolution, this thesis conquers the problem that the data sparseness in the machine leaning method for the demonstrative and indefinite pronoun resolution and the neglect of the lexical information in the kernel-based machine learning method for the personal pronoun resolution, arised from the non-relative pronouns divided too small. This thesis adopts the uniform rule-based method for the all non-relative pronoun coreference resolution. For the noun phrase resolution, the rule-based method is adopted. The result on the BioNLP Shared Task2011development data is improved significantly for the non-pronoun resoltion and the overall result is also improved by1.21%compared to the current best coreference resolution system. This shows that our hybrid method is effective.In our two resolution methods, the hybrid method can achieve much better performance than the single machine learning method. However, the single machine learning method has much better robustness. Overall, the two biomedical coreference resolution system are all effective and the performances are all impoved.
Keywords/Search Tags:Coreference Resolution, Biomedical Texts, Machine Learning, MethodsBased on Rules, Integration of a Variety of Methods
PDF Full Text Request
Related items