Font Size: a A A

Cross-Language Information Retrieval Based On Statistical Language Modeling

Posted on:2010-03-20Degree:MasterType:Thesis
Country:ChinaCandidate:S S SuFull Text:PDF
GTID:2178360302460361Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the multiformity of Internet resource and increasing of users' language diversity, cross-language information retrieval begins to attract universal attention and gradually have already become one of the hot research topics in the information retrieval community. It allows users to retrieval documents in one language with queries in another language, which may greatly benefit users that don't familiar with foreign languages.The language model based information retrieval paradigm brings the IR technology into a promising yet challenging new world. Compared with traditional retrieval models, language modeling approach not only has solid theoretical foundations, but also has great flexibility. Under certain assumption, classic retrieval models can easily be deduced from language modeling retrieval framework. In addition, a large number of experimental results show that this method is superior to other models, thus getting more and more attention from numerous researchers. However, this method is mainly used in monolingual retrieval tasks, and few researchers care about its application in cross-language environment. To solve this problem, we first present the principles and details of language modeling approach for monolingual information retrieval systematically, and then introduce two cross-language retrieval model, namely statistical translation model and cross-lingual relevance model respectively, so as to expend it into multilingual environment. Out of vocabulary problem and the translation ambiguity problem are two most important issues that needed to be addressed in order to improve cross-language retrieval performance. To solve these two problems, we first deeply analyze the problems and then put forward the homologous strategy of solution.(1) Web-based OOV translation extraction. Most existing approaches to the OOV problem are based on statistical technique, making the assumption that the more a candidate co-occur with the OOV word, the higher possibility it become the correct translation of that OOV word. However, all these statistical-based methods rely heavily on the size of the corpus, and usually lose effectiveness when no sufficient corpus is available. In view of these facts, we present a novel measurement, called frequency similarity metric, which is especially suitable for small corpora. Experimental results show that our method can not only increase the translation extraction accuracy, but also can improve the cross-language retrieval effectiveness. (2) Translation disambiguation. Firstly, the ambiguity resolution problem is transformed into a ranking problem on the graph, which is build according to a bilingual translation dictionary. Then we can use random walk algorithm, i.e. PageRank, to iteratively calculate the weight of each candidate word in the graph. We believe that the higher weight a word gets, the more likely it becomes the correct translation. When the algorithm convergences to the stable distribution, we can choose the words with the highest weights as the final translation.
Keywords/Search Tags:Cross-Language Information Retrieval, Language Modeling, OOV, Translation Extraction, Translation Disambiguation
PDF Full Text Request
Related items