Font Size: a A A

Research On Word Sense Discrimination Based On Statistical Learning

Posted on:2012-05-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:D M FanFull Text:PDF
GTID:1118330368482927Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
How to solve the language ambiguity problem is the issue always plaguing the language processing technology researchers. Polysemous situation is thought to be the most important performance of language ambiguity. The problem solved by word sense discrimination (WSD) is how to identify the right meaning of the word by the context of language environment of the ambiguous words. In the field of natural language understanding, word recognition is the application of basic research issue, and also, one of the most important and difficult things of natural language understanding.Traditional word recognition is mostly studied by rule-based method, but in recent years, with the improvement and enhancement of computing and storage technology, statistical learning method is becoming more and more popular, and quickly became the mainstream research methods of word recognition. We can get a high identification accuracy when we apply the study method with supervision to word recognition, but such methods require a large enough scale of training sample which is not easy to gain. Although unsupervised method does not need training samples with manual tag, the relative effect of word sense discrimination is not very good.Some key issues which should be solved by the statistical word sense discrimination were analyzed in this paper. The questions involved were discussed one by one, not only the construction of dictionary and corpus resources, the modeling method of the problem of word recognition, but also the feature selection of the semantic classification. Basing on these issues, we got the thought of word-sense category extending at the end of this paper, and also, we discussed how to apply it in the research of the statistical word sense discrimination.This research and innovation are as follows:1. The relationship between word sense discrimination and word sense characterization were discussed from the perspective of semantic computability, and it was also studied how to make re-integration of existing dictionary resources and to construct a new machine readable dictionary by the scientific control of the semantic granularity, providing better service for word sense discrimination. Experimental results showed that the size of semantic granularity in the semantic characterization directly affected the accuracy of word sense discrimination. The accuracy of WSD could be increased by proper control the semantic granularity without ambiguity. It was proposed to integrate the existing dictionary resources, to build a new category dictionary for word sense discrimination;2. The improved Bayesian model by information gain for the feature selection of word sense discrimination was proposed. The word sense classifiers established by Naive Bayes model, Maximum Entropy method and Support Vector Machine were used as reference models in experiments which the effectiveness of Bayesian model improved by information gain. The results showed that the classifiers constructed by Maximum Entropy and Support Vector Machine are stronger than the Naive Bayes model, in which Support Vector Machine is the best in several of reference models. But the Bayesian model improved by information gain in word sense discrimination is more prominent comparing with the reference models, and its experimental results were much higher 1.4 percentage points than the SVM classifier, the Bayesian model improved by information gain obtained the best results in comparative experiments;3. The use of artificial ambiguous word technology was analyzed and discussed by experiments form the perspective of the difficulty in the construction of large-scale corpus resources, and the concept of vicarious words and the new method of word sense discrimination based on vicarious words were proposed. The results showed artificial ambiguous words technology could help researchers to relieve the pressure of the shortage of training data. The vicarious words technology from artificial ambiguous word allows researchers to achieve an unsupervised method for WSD avoiding the use of the training samples with manual tagging. Experimental results showed the WSD method based on vicarious words had high discrimination accuracy;4. The ideas of Word-sense Class Extending (WSE) and a new word sense discrimination method based on WSE were proposed for the problem which the size of training corpus was not sufficiently large. This new method could obtain more semantic information in limited training corpus to enhance training efficiency and improve the effect of word sense discrimination by WSE, in addition, the WSE technology could statistics related word s information in raw corpus (no prior knowledge of semantic tags, etc.) in order to provide additional training samples. Experimental results showed that the word sense discrimination method based on WSE improved the efficiency of the training corpus, made better the effectiveness of WSD. The WSE technology provided a new idea to enhance the effectiveness of statistical learning in small-scale training corpus.To sum up, this article had give some useful attempts in resource-building, word sense discrimination modeling, feature Selection, as well as on how to improve word sense discrimination with supervision, and we had achieved some initial results. With the further research of word sense discrimination, more and more new ideas and solutions will be emerged.
Keywords/Search Tags:natural language processing, word sense discrimination, statistical learning, vicarious word, word-sense class extending
PDF Full Text Request
Related items