Context Computing Applications, Word Disambiguation

Posted on:2007-11-09

Degree:Master

Type:Thesis

Country:China

Candidate:L Duan

Full Text:PDF

GTID:2208360182461577

Subject:Software engineering

Abstract/Summary:

In this paper, we reviews the development of automatic word segmentation and word tagging domestic and overseas. Although there are many difficulties in these works, progress have been made by many researchers. We also shortly reviews the development of corpus research and its great impact on various language related areas of research, both theoretical and methodological.Then we introduces in detail the development and present situation of word segmentation disambiguation and multi-categories word disambiguation - two important topics in the work of automatic word segmentation and word tagging.Target of the paper is to find the better way of word disambiguation using corpus. The paper optimizes the model of contexts computation based on three essential assumption - stability, particularity and computable of the contexts, and finishes the programming of computer software to deal with the linguistic ambiguous phenomena which are widely existed in language but can be disambiguated in contexts.Information extracted from the corpus is stored in database which offers convenient data management and high re-use mechanism. The model of contexts computation is applied in challenging works such as: disambiguation of multi-catagories words, disambiguation of crossing word segmentation ambiguity and disambiguation of covering word segmentation ambiguity. Experiments are based on corpus of 6 years Daily of People totalizing 137,560,000 words and 345,000 different words.The results show that the rate of correctly disambiguating covering word segmentation ambiguity is over 99% in the closed test and 87.84% in the open test. In test of crossing word segmentation disambiguation , correct rate is over 94%, average wrong rate is 25% lower than that of ICTCLAS system developed by the Institute of Computation in China Academy of Science. In multi-categories words disambiguation, accurate rate of disambiguition is 95.25% in the closed test, and 95.21% in the open test, which is 23.95% higher than that of ICTCLAS system. According to the phraseological restrict rules, we have optimized the procedure of disambiguation. The new procedure lead to a accurate rate nearly 97.9% when dealing with examples that have disciplinary word collocation. This result is 26.6% higher than that of ICTCLAS system.

Keywords/Search Tags:

corpus linguistics, linguistic model, relative word frequency, Chinese word segmentation disambiguation, multi-categories word disambiguation

Related items

1	Research On Statistical Method Of Chinese Word Meaning Disambiguation Based On Multi - Classifier
2	The Research Of Chinese Word Segmentation Disambiguation Based On Word Environment Information
3	Research On Word Sense Disambiguation Based On DBN
4	Research On The Technology Of Disambiguation For Chinese Word Srgmentation
5	Research On Automatic Disambiguation Method Of Tibetan Word Meaning Based On Chinese And Tibetan Parallel Corpus
6	Research Of Similarity Based On Relative Word Frequency
7	The Research On Chinese Word Sense Disambiguation Based On Corpus
8	Research On Chinese Word Sense Disambiguation Model Based On Bidirectional Recurrent Neural Network
9	Research On Chinese Word Sense Disambiguation Method Based On Graph Model
10	Chinese Word Sense Disambiguation Based On Moses