Font Size: a A A

Chinese Word Sense Disambiguation Research And Implementation For The Full Text Annotation

Posted on:2016-06-29Degree:MasterType:Thesis
Country:ChinaCandidate:Y F BianFull Text:PDF
GTID:2308330464464456Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Word Sense Disambiguation (WSD) is an important research topic in the field of Natural Language Processing, the results of WSD have a significant impact on machine translation, information retrieval, information extraction and text mining, speech recognition, and hence, WSD is important both in theoretical research and practical application.Word Sense Disambiguation task is divided into two categories:sampling task and all words task, this article mainly focus on all words task, that is, for all polysemous words appear in the given text conduct sense tagging. In order to achieve this goal, this thesis mainly completed the following works:1. This thesis chooses the "Modern Chinese semantic dictionary"(SKCC) as meaning tagging system, but in the experimental part, some sense tagged corpus was used, which meaning tagging system is the "Modern Chinese grammatical information dictionary"(GKB), and part of the meaning of polysemy in SKCC is inconsistent with GKB. In order to find out more reasonable sense division of a polysemous word in the SKCC, this thesis made a dictionary mapping work, and replace the GKB label in the tagged corpus with SKCC’s. To establish dictionary mapping relationship convenient and efficiently, we developed a dictionary mapping toolkit. If there is the same sense in the two dictionaries, the toolkit can establish the mapping relation automatically, provides great convenience for dictionary mapping work. Based on the results of the dictionary mapping, we inviting linguistics graduates to correct a part of polysemous words in the SKCC, making the sense classification of those words in the dictionary more reasonable.2. In order to disambiguate all polysemy words, this thesis proposes an active learning method which based on Relative Frequency Ratio (RFR). The method makes use of large unlabeled corpus to calculate the context RFR of the target polysemy word, and RFR represents the collocation strength between target polysemy word and context words. Choosing those context words which collocation strength is higher as common collocations of the target word, and confirming the target word sense with collocations through artificial, so as to realize unlabeled corpus batch tagging and treat them as the training corpus. This thesis randomly selected eight polysemous words for disambiguation experiment, using the labeled corpora to conduct supervised experiment, the average precision was only 74.52%. For the same test corpus, and the same feature selection method, the active learning method based on RFR got average precision of 85.01%, about 10.49% higher than the supervised method.3. This thesis relies on NSFC(National Natural Science Foundation of China) project-"Chinese word sense tagging key technologies", the project needs to implement the subtask which named "for full text semantic disambiguation" based on the research work, hence this thesis developed the WSD platform for full text annotations. The platform is based on the Django framework technology, which has four modules:word sense query, polysemous word distribution query of training instances, manual annotation, and word sense disambiguation. Through this platform, not only we can use the disambiguation model for automatic word sense tagging in the input text, but also available for artificially tagging the unlabeled corpus, and then add the labeled corpus to training corpus to expand the scale, improving the accuracy of the background disambiguation model. With the adding annotation instance, disambiguation model updating, which would make disambiguation models of the platform has greater application value.
Keywords/Search Tags:WSD, dictionary mapping, RFR, active learning, WSD platform
PDF Full Text Request
Related items