Font Size: a A A

Research On Unsupervised Word Translation Disambiguation Based On Automatic Knowledge Acqusition

Posted on:2009-06-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:P Y LiuFull Text:PDF
GTID:1118360272980508Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Internet is the king of the world at present. To most people nowadays, it is the most important mean to acquire information from Internet by an efficient search engine. There is an insurmountable barrier between understanding different kind of languages in this more and more international information so that the most urgent problem for us is to research machine translation (MT) and cross-language information retrieval (CLIR). There are many hard problems have not been resolved on the research yet, one is how to select right target language translation while facing an ambiguity source language word, which is so-called the problem of word translation disambiguation (WTD). WTD and its similar task - word sense disambiguation (WSD) in mono-lingual category are important and hard in the research of nature language processing (NLP) and are always the basis of it.Facing the current situation, through the comparision and analysis on all kinds of unsupervised disambiguation methods, this thesis considers that the key problem of unsupervised WTD/WSD is in three fields: knowledge acquisition, data sparseness and the construction of biligual semantic resourse. So, aiming at the study of knowledege acquisition and conquer data sparseness which are the core problems in unsupervised WTD/WSD, this thesis introduces a series of creative methods on unsupervised WTD knowledge acquisition. All of the methods are evaluated on the international golden set of semantic evalution and are all superior to the best comparable systems and get the state of arts result. Another key problem in WTD - bilingual semantic dictionary auto-construction is also studied.In detail, this thesis is arranged as following:1. Automatic sense-tagged corpus acquisition, forms the disambiguation classifier directly and introduces the classifier to WTD are all studied. On the basis of that, the thesis introduces the concept of Equivalent PseudoTranslation (EPT) and the WTD method based on it. Finally, it is compared with other unsupervised systems on Senseval-2 English Lexical Sample task. 2. A fully unsupervised WTD method based on the indirect association(IA) between bilingual words and Web mining is introduced through our investigation in biligual parallel corpus. Four methods of computing IA on Web are designed and three different kind of decision strategics are used during the disambiguation process. Futhermore, the thesis considers the Web as a special semantic lexicon so that the relatedness between bilingual words (WBR) could be defined and computed directly by using Page Counts which Web search engine returned. After testing WBR on a revised golden data set and proving its feasible, a fully unsupervised method is designed for WTD. Both the IA method and the WBR method of WTD are tested on Semeval 2007 task 5- Multilingual Chinese-English Lexical Sample task and made comparison on it.3. An unsupervised WTD method based on the N-gram language model(LM) and Web mining is introduced after the observation of varies series of synonyms in different sentences. On the basis of the supposition that―different sense, different N-gram pattern‖, the model make disambiguation by using LM knowledge not by semantic knowledge. After testing on the same golden set for comparison with other systems, it shows that the performance of the proposed model is excellent and out-performs all other comparable unsupervised systems. Detail comparison and combining upperbound between LM based model and semantic based model is discussed either4. A methods that can generate a WTD application-oriented bilingual semantic dictionary automatically from WordNet, HowNet and an large-scale blingual parallel corpus is studied. The method mainly uses similarity of WordNet and HowNet between words to filter statistic noise data between the processing of word alignment. Finally, it forms a bilingual semantic dictionary which has imformation of these three resources: WordNet, HowNet and bilingual parallel corpus.In brief, basiclly, a complete set of knowledge acquisition-oriented WTD solution methods has been established, especially the Web-based methods which explored the hard problem in NLP - WTD/WSD on the Web search way.
Keywords/Search Tags:unsupervised word translation disambiguation, knowledge acquisiton, web bilingual relatedness, web indirect association, language model
PDF Full Text Request
Related items