Font Size: a A A

Context Computing Applications, Word Disambiguation

Posted on:2007-11-09Degree:MasterType:Thesis
Country:ChinaCandidate:L DuanFull Text:PDF
GTID:2208360182461577Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In this paper, we reviews the development of automatic word segmentation and word tagging domestic and overseas. Although there are many difficulties in these works, progress have been made by many researchers. We also shortly reviews the development of corpus research and its great impact on various language related areas of research, both theoretical and methodological.Then we introduces in detail the development and present situation of word segmentation disambiguation and multi-categories word disambiguation - two important topics in the work of automatic word segmentation and word tagging.Target of the paper is to find the better way of word disambiguation using corpus. The paper optimizes the model of contexts computation based on three essential assumption - stability, particularity and computable of the contexts, and finishes the programming of computer software to deal with the linguistic ambiguous phenomena which are widely existed in language but can be disambiguated in contexts.Information extracted from the corpus is stored in database which offers convenient data management and high re-use mechanism. The model of contexts computation is applied in challenging works such as: disambiguation of multi-catagories words, disambiguation of crossing word segmentation ambiguity and disambiguation of covering word segmentation ambiguity. Experiments are based on corpus of 6 years Daily of People totalizing 137,560,000 words and 345,000 different words.The results show that the rate of correctly disambiguating covering word segmentation ambiguity is over 99% in the closed test and 87.84% in the open test. In test of crossing word segmentation disambiguation , correct rate is over 94%, average wrong rate is 25% lower than that of ICTCLAS system developed by the Institute of Computation in China Academy of Science. In multi-categories words disambiguation, accurate rate of disambiguition is 95.25% in the closed test, and 95.21% in the open test, which is 23.95% higher than that of ICTCLAS system. According to the phraseological restrict rules, we have optimized the procedure of disambiguation. The new procedure lead to a accurate rate nearly 97.9% when dealing with examples that have disciplinary word collocation. This result is 26.6% higher than that of ICTCLAS system.
Keywords/Search Tags:corpus linguistics, linguistic model, relative word frequency, Chinese word segmentation disambiguation, multi-categories word disambiguation
PDF Full Text Request
Related items