Cross-Language Information Retrieval Based On Statistical Language Modeling

Posted on:2010-03-20

Degree:Master

Type:Thesis

Country:China

Candidate:S S Su

Full Text:PDF

GTID:2178360302460361

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the multiformity of Internet resource and increasing of users' language diversity, cross-language information retrieval begins to attract universal attention and gradually have already become one of the hot research topics in the information retrieval community. It allows users to retrieval documents in one language with queries in another language, which may greatly benefit users that don't familiar with foreign languages.The language model based information retrieval paradigm brings the IR technology into a promising yet challenging new world. Compared with traditional retrieval models, language modeling approach not only has solid theoretical foundations, but also has great flexibility. Under certain assumption, classic retrieval models can easily be deduced from language modeling retrieval framework. In addition, a large number of experimental results show that this method is superior to other models, thus getting more and more attention from numerous researchers. However, this method is mainly used in monolingual retrieval tasks, and few researchers care about its application in cross-language environment. To solve this problem, we first present the principles and details of language modeling approach for monolingual information retrieval systematically, and then introduce two cross-language retrieval model, namely statistical translation model and cross-lingual relevance model respectively, so as to expend it into multilingual environment. Out of vocabulary problem and the translation ambiguity problem are two most important issues that needed to be addressed in order to improve cross-language retrieval performance. To solve these two problems, we first deeply analyze the problems and then put forward the homologous strategy of solution.(1) Web-based OOV translation extraction. Most existing approaches to the OOV problem are based on statistical technique, making the assumption that the more a candidate co-occur with the OOV word, the higher possibility it become the correct translation of that OOV word. However, all these statistical-based methods rely heavily on the size of the corpus, and usually lose effectiveness when no sufficient corpus is available. In view of these facts, we present a novel measurement, called frequency similarity metric, which is especially suitable for small corpora. Experimental results show that our method can not only increase the translation extraction accuracy, but also can improve the cross-language retrieval effectiveness. (2) Translation disambiguation. Firstly, the ambiguity resolution problem is transformed into a ranking problem on the graph, which is build according to a bilingual translation dictionary. Then we can use random walk algorithm, i.e. PageRank, to iteratively calculate the weight of each candidate word in the graph. We believe that the higher weight a word gets, the more likely it becomes the correct translation. When the algorithm convergences to the stable distribution, we can choose the words with the highest weights as the final translation.

Keywords/Search Tags:

Cross-Language Information Retrieval, Language Modeling, OOV, Translation Extraction, Translation Disambiguation

PDF Full Text Request

Related items

1	Research On Techniques Of Query Translation For Cross-language Information Retrieval
2	Query Translation Based On Visual Information For Cross Language Retrieval
3	Design And Implementation On Large-scale Patent Literatures Translation And Cross-language Retrieval System Based On Hadoop
4	Study On Web-based Translation Technology For Out-of-Vocabulary
5	Study And Implementation Of Cross-language Information Retireval Technology
6	Research Of Some Key Issues In Highly Adaptive Example-Based Machine Translation
7	Building Comparable Corpora Based On Cross-language Text Similarity Metrics
8	The construction, use, and evaluation of a lexical knowledge base for English-Chinese cross-language information retrieval
9	Realization Of Design And Evaluation Of System For Speech Translation Lexicon
10	The Application Of Cross-Language Information Retrieval Based On Latent Semantic Analysis