Font Size: a A A

Chinese Word Sense Disambiguation With AdaBoost.MH Algorithm

Posted on:2007-01-12Degree:MasterType:Thesis
Country:ChinaCandidate:F C LiuFull Text:PDF
GTID:2178360182960718Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Word sense disambiguation (WSD) plays an important role in many areas of natural language processing such as machine translation, information retrieval, sentence analysis, speech recognition. The research on WSD has great theoretical and practical significance. The main work in the dissertation is to study the supervised learning algorithm learning WSD knowledge from many kinds of resources based on large sense-tagged Chinese corpus.An approach based on supervised AdaBoost.MH learning algorithm for Chinese word sense disambiguation is presented. AdaBoost.MH algorithm is employed to learn WSD knowledge from many kinds of resources and to boost the accuracy of the weak stumps rules for decision trees and repeatedly calls a learner to finally produce a more accurate rule. A simple stopping criterion is also presented in view of the efficiency of learning and the utility of system.In contrast experiment between AdaBoost.MH algorithm and Naive Bayes algorithm, the former has a higher learning capability. For the open tests' accuracy rates in SENSEVAL3 Chinese corpus, the former outdoes 8 percentage points compared to the latter.As for Chinese WSD, in order to extract more contextual information, this paper introduces a new WSD knowledge which is semantic categorization as well as two classical knowledge sources: part-of-speech of neighboring words and local collocations. Experimental results show that the semantic categorization knowledge is useful for improving the learning efficiency of the algorithm and accuracy of disambiguation.AdaBoost.MH algorithm has a higher disambiguation accuracy rates in open tests which are 85.75% for 6 typical polysemous Chinese words and 75.84% for 20 polysemous words from SENSEVAL3 Chinese corpus.Due to the flexibility and complexity of building up a broad coverage semantically annotated corpus, an approach based on WWW search engines to automatically obtain annotated corpus for Chinese WSD is presented. Experimental results show that the approach is feasible.
Keywords/Search Tags:Natural Language Processing, Word Sense Disambiguation, AdaBoost.MH Algorithm, Multiple Knowledge Sources
PDF Full Text Request
Related items