Font Size: a A A

Research On Key Technologies Of Word Sense Disambiguation Based On Statistical Learning

Posted on:2015-07-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y ZhouFull Text:PDF
GTID:1108330509961004Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Ambiguity of natural language is one of the main obstacles to process and understand text. This paper is focused on lexical ambiguities, i.e. automatic word sense disambiguation (WSD), which selects proper sense for a word in give context. WSD is regarded as one of the fundamental techniques in natural language processing, which plays an important role in supporting machine translation, information retrieval, semantic parsing and so on.Most current researches disambiguate polysemous words individually, without regard to the dependency among polysemous words in the sentence. Based on global optimization, this paper focuses on the application of structured machine learning methods on WSD, and combines the syntactic structure and the graph in machine learning to improve further the performance of WSD system. In addition, the deficiency of labeled data is always obsessing WSD research. Labeling polysemous words by hand is very labor-intensive, and there is no enough sense labeled data for any language. Meanwhile, there is a large volume of unlabeled data, especially on the Internet, and it is a hot topic to exploit the cheap unlabeled data to improve the performance of WSD. This paper explores the application of topic feature and bootstrapping in WSD. The content of this paper is as follows.(1) All words WSD (AW-WSD) is to disambiguate all open words (i.e. nouns, verbs, adjectives and adverbs) in given text. The disambiguation results of all polysemous words in the sentence should be related, but most current methods do not take it into consideration and disambiguate the words individually. This paper fully takes advantage of the dependency between polysemous words in the sentence. First, we model the AW-WSD with hidden Markov model (HMM), and thus transform the AW-WSD to a sequence labeling problem. However, the WSD methods based on HMM can only use simple observation, and this paper extends the model to maximum entropy Markov model (MEMM) to integrate a large number of linguistic features. AW-WSD contains a large number of states, which incurs data sparseness and high time complexity for the methods based on HMM and MEMM. These two problems are solved in this paper by beam search Viterbi algorithm and smoothing strategy, respectively. Finally, the proposed models are evaluated on AW-WSD datasets of Senseval-2 and Senseval-2004, and the performance of disambiguation method based on MEMM is comparable with the best result in Senseval evaluation.(2) The disambiguation method based on MEMM normalizes hidden states individually, which incurs disambiguation label bias for WSD. This paper employs conditional random fields (CRF) to overcome the problem of disambiguation label bias. However, CRF is with very high time complexity and can not solve the problem with large state space like WSD. This paper reduces the training time complexity of CRF from O(mLTN2) to O(mLTR2) by approximation training and parallelization, where N is the number of states ranging from tens of thousands to hundreds of thousands, and R is the maximum number of senses for a word, which is about dozens. Meanwhile, the decoding time complexity of CRF is reduced from O(TN2) to O(TR2) by beam search. The method based on CRF is evaluated on AW-WSD of Senseval-2004, and the recall is 0.657, which exceeds the best system in the evaluation. In addition, to better exploit the information of dependency tree, this paper extends the graph structure of CRF from linear chain to tree structure. The method based on tree-structured CRF is evaluated on AW-WSD of Senseval-2004 with recall 0.668, which demonstrates that syntactic information can improve the performance of WSD.(3) This paper employs topic features to improve the performance of WSD. The context of target word in the sentence is quite limited, and the sense labeled data is rare, which incurs serious data sparseness. As an unsupervised learning method, topic model tries to clustering the word space and improves the generality of word. This paper proposes a disambiguation method integrating topic feature, which enhances the classifier by latent Dirichlet allocation topic feature inferred from unlabeled. The method is evaluated on the AW-WSD of Senseval-2004 with recall 0.68, which exceeds the best record of the evaluation in the literature. The experiments also demonstrates that proper number of topics help WSD, the genre of unlabeled corpus has a direct impact on the performance of disambiguation, and large-scale balanced unlabeled corpus help WSD more.(4) This paper also combines both labeled and unlabeled data to improve the performance of disambiguation, which is usually called bootstrapping. The deficiency of sense labeled data and the abundance of unlabeled data inspire us to exploit unlabeled data to improve the performance. In bootstrapping, we first build the initial classifier, classify the unlabeled instances by the initial classifier, and add the automatic labeled instances to training dataset. Above process is iterated to enlarge training dataset so as to improve the performance of disambiguation. With the Chinese lexical sample WSD dataset of Senseval-2004, this paper explore the influence of number of iteration, size of instance pool and growth rate on performance of disambiguation in bootstrapping. This paper designs the sampling algorithm maintaining category ratio to avoid the imbalance of category ratio in bootstrapping. In addition, this paper increases the probability of obtaining optimal empirical parameter by smoothing with multiple classifiers.
Keywords/Search Tags:word sense Disambiguation, Statistical learning, Structured machine learning, Unlabeled corpus
PDF Full Text Request
Related items