Font Size: a A A

Automatic Knowledge Acquisition For Word Sense Disambiguation

Posted on:2011-05-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:C CheFull Text:PDF
GTID:1118360332457069Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Accompanying the widespread use of the Internet, natural knowledge in the form of forum, blog, etc., is increasing at a tremendous speed. Mining and exploiting this knowledge resource is a challenging task for natural language processing (NLP) as ambiguities are quite prevailing in natural language. Therefore WSD (Word Sense Disambiguation) is critical to NLP, such as machine translation, information retrieval and information extraction. However, the wide application of WSD suffers from knowledge acquisition bottleneck, that is, WSD approaches may fail to perform disambiguation correctly or can not perform disambiguation at all due to lack of disambiguation knowledge or knowledge extraction difficulties. This drawback makes it less likely to enhance WSD performance, which is the major obstacle for WSD application.As an effort to tackle this WSD application issue in NLP, we propose an approach to automatically acquire knowledge for WSD. My research has been funded by the national 863 program and the National Natural Science Foundation. Different approaches have been proposed to acquire, from different sources, knowledge which has been integrated to relieve knowledge acquisition difficulty.My contribution is as follows:(1) This paper proposes a WSD method combining several disambiguation strategies based on the sememe relationship of HowNet. The approach benefits from part-whole and attribute-host relations, the most basic and important sememe relations in HowNet. In addition, value-attribute relation, message structure and semantic similarities are employed for disambiguation. Different disambiguation strategies act on different parts of speech of words. By combining different disambiguation strategies with different characteristics, we make different strategies complement one another and exploit fully the disambiguation knowledge embodied in HowNet to improve the disambiguation accuracy.(2) To improve the quality of the sense-tagged corpus retrieved automatically through equivalent pseudowords, we present an automatic sense-tagged corpus acquisition method that integrates two filters, selecting equivalent pseudoword samples against context similarity and sentence samples against co-occurrence frequency. The integration of the two filters generate better quality sense-tagged corpus.(3) An approach integrating automatic and manual tagging methods has been suggested for WSD so as to relieve their limitations when applied independently. Based on manually tagged corpus, this integrative approach not only provides samples for those ambiguous words that have no equivalent pseudowords, but also calculate the distribution probabilities of different senses. In the meanwhile, this approach also employs equivalent pseudowords to automatically tag large amounts of corpus to supplement the insufficiency of manually-tagged corpus. The supplementation of these two methods enables the proposed method to perform better in WSD.We also present a WSD approach based on semantic relationship graph, which integrates HowNet, raw corpus and manually tagged corpus.In addition, we propose to integrate our WSD approach into text categorization and suggest a feature representation method by combining concepts and words. All the proposed WSD methods are evaluated on the data provided by the Chinese lexical sample task of Senseval-3 (2004). The performance of our proposed methods is close to or better than that of other approaches that participated in the evaluation task, which confirms the effectiveness of our methods.Our first method exploits the semantic relation and structure relation among sememes in HowNet to run the disambiguation. The second one automatically acquires sense tagged corpus using equivalent pseudowords, which effectively tackles the insufficiency of annoted corpus in corpus-based approach. The third method benefits from the quantity of automatically tagged corpus and the quality of manually tagged corpus.Our proposed methods improve the coverage and accuracy of WSD and expect to promote WSD research in theory and application.
Keywords/Search Tags:Word sense disambiguation, Knowledge acquistion bottleneck, HowNet, Equivalent pseudoword, tagged corpus
PDF Full Text Request
Related items