Font Size: a A A

Research On Automatic Disambiguation Method Of Tibetan Word Meaning Based On Chinese And Tibetan Parallel Corpus

Posted on:2016-11-04Degree:MasterType:Thesis
Country:ChinaCandidate:X M JiangFull Text:PDF
GTID:2208330470966827Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Word sense disambiguation is an important content of semantic analysis, and it is also an important problem in natural language processing. This study can support machine translation and other high level applications. The increasing demand of the Tibetan natural language applications asks the word sense disambiguation must be fully developed which can be regarded as the key technology, but limited by the Tibetan information processing research status, the Tibetan word sense disambiguation research is still in the initial stage.At present, Word sense disambiguation method can be divided into two classes: the methods based on statistical and rules. The method based on statistical includes both supervised and unsupervised methods. Supervised method need the support of corpus with a detailed word sense tagging. Unsupervised methods require massive corpus as a support material. According to the current progress in Tibetan language information processing, we don’t have the corpus with detailed word sense tagging or the massive corpus without word sense tagging, even there are enormous difficulties when we build them. The method based on semantic knowledge base is derived from the rule-based method, which has obtained knowledge with the help of machine learning methods in English, and it is also one of the most dynamic directions in the research of word sense disambiguation, and this kind of method is proved effective in the study of English and Chinese word sense disambiguation.In view of the above reasons, this paper proposes two methods that can select the right Chinese meaning for Tibetan ambiguous words of Tibetan sentences in Tibetan-Chinese parallel corpus, with the help of semantic knowledge from HowNet and translation information from Tibetan-Chinese parallel corpus.The work carried out in this paper are as follows:1. Improving the word similarity calculation method: On the basis of traditional methods which with the help of sememe word semantic distance, we integrate multiple complementary information like the lowest height of common parent and the level height difference of sememe into our new calculation method. In the meanwhile, this paper also proposes a relevance calculation method based on the semantic role in HowNet as auxiliary information.2. Combining Tibetan-Chinese parallel corpus with HowNet to study the Tibetan word sense disambiguation method:firstly, this method does preprocessing operations for corpus which include word segmentation and POS tagging; secondly, collecting Chinese meanings for all Tibetan words in corpus with the help of Tibetan-Chinese dictionary, so we can determine which words are ambiguous words; thirdly, selecting the right Chinese meaning for ambiguous word by calculating the semantic word similarity and relevance between context and the Chinese meanings of ambiguous word. Using this method to eliminate lexical ambiguity can get 55.04% of average precision on the word level, and can also get 50.4% of average precision on the sentence level in selected corpus.3. Using the network graph that is based on semantic knowledge to study Tibetan word sense disambiguation:in view of the data sparseness problem between context and meaning that is existed in the former method, this method proposes to build the semantic relation diagram for Chinese meaning by using the rich semantic information in HowNet. We can get the selection parameter of Chinese meanings by calculating the relevance between context and relationship items in the relation diagram. According to the selection parameter to select the right Chinese meaning for Tibetan ambiguous word. The experimental result of this method can enhance the average accuracy 3.7% than the former method on the sentence level, and also can elevate average accuracy 3.12% on the word level.
Keywords/Search Tags:Tibetan information processing, word sense disambiguation, Tibetan-Chinese parallel corpus, HowNet, semantic relation diagram
PDF Full Text Request
Related items