Word Sense Disambiguation Corpus Automatic Acquisition

Posted on:2009-03-01

Degree:Master

Type:Thesis

Country:China

Candidate:Y H Guo

Full Text:PDF

GTID:2178360278964516

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

The phenomenon that one word has several senses brings many difficulties to the processing of natural language by computer. In the final analysis, plenty of problems from natural language understanding are to solve the problem of ambiguous terms. Since the issue's impact was noted, it has passed more than 60 years. During that period, academics put forward a number of ways to word sense disambiguation (WSD). With the development of large-scale computer text-processing technology, supervised machine learning methods predominates in the approaches toward WSD tasks due to their high accuracy. However, these methods'successes depend on enough training data deeply. And the annotation of these data is time consuming and laborious as well as difficult to guarantee the consistency. Data sparseness led by the lack of training data restricts the promotion of the supervised methods. Some studies started in the purpose of obtaining training corpus automatically. Among them, a method using synonyms to expand training corpus has lower resources costs and better expandability. However, the experiment found that the corpus this method obtained contains too much noise and has high bias. Therefore, focusing on how to obtain effective training corpus automatically, this article promotes a two-stage strategy of expansion-verification, which eliminates noise in the training corpus brought by expansion stage. Here we focus on the verification capabilities of two ways which are based on language model and pointwise mutual information respectively.In order to contrast in the follow experiment, an SVM based supervised WSD system was developed in this article. Experiment on Semeval-2007 English lexical sample corpus shows that the linear kernel SVM has the best performance. Next we use the synonyms of the target words in Senseval-3 Chinese corpus and Semeval-2007 English corpus to obtain candidate WSD corpus on Web and raw corpus, then filter these corpus using language model and pointwise mutual information approaches and append these expansion corpus into the supervised systems respectively. The results show that both of these two approaches have the capability to verify and improve the final performance of the system. Language model approach improves the accuracy of the system on Senseval-3 Chinese lexical sample corpus from 62% up to 63.06%. Evaluation on Semeval-2007 English lexical sample corpus shows the accuracy improves from 88.19% to 88.46% by the pointwise mutual information verification approach.

Keywords/Search Tags:

Natural language processing, Word sense disambiguation, Language model, Pointwise mutual information

PDF Full Text Request

Related items

1	Research On Word Sense Disambiguation Based On The Strategy Of Field Priority Selection
2	Research On Word-level Ambiguity Resolution Method
3	Based On Semi-supervised Method Of Chinese Word Sense Disambiguation
4	Chinese Word Sense Disambiguation Based On Parsing Tree
5	Research And Application Of Word Sense Disambiguation Method Based On Contextual Semantic
6	Research On Chinese Word Sense Disambiguation Model Based On Bidirectional Recurrent Neural Network
7	Word Sense Disambiguation Technology Research Based On Hownet And Bayesian Model
8	Chinese Word Sense Disambiguation Based On Semantic
9	Research On Word Sense Disambiguation Method Based On Word Embedding
10	An Approach For Word Sense Disambiguation Based On WordNet