Font Size: a A A

Research On Cross Language Information Retrieval Based On Comparable Corpora

Posted on:2016-11-10Degree:MasterType:Thesis
Country:ChinaCandidate:Q Y ZhuFull Text:PDF
GTID:2308330464972624Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Cross Language Information Retrieval is a search method that users input queries in one language and get information in other language. The research on Cross Language Information Retrieval aim to solve the problem of information retrieval causing by language differences, thus to increase resource utilization rate, make it easier to search information on the Internet. Cross Language Information Retrieval is an advanced and hot field of study in information retrieval.This paper mainly focuses on cross language information retrieval based on comparable corpora. Main works and contributions of this study can be summarized as follows:(1) We proposed in this paper an improved approach to extract bilingual lexicon from comparable corpus, thus to improve the quality of bilingual lexicon extraction. Only a few studies have made use of alignment information in bilingual lexicon extraction from comparable corpora, in which comparable corpora are necessarily divided into 1-1 aligned document pairs. They have not been able to show extracted lexicons benefit from the embedding of alignment information because they need expensive consumption when translating the comparable corpus to parallel corpus. Moreover, strict 1-1 alignments do not exist broadly in comparable corpora. Moreover, by extracting aligned pairs, they had actually reduced a lot the size of available corpora and suffered from great information loss. For the details of the approach, when extracting bilingual lexicon, we combined the classic lexical context with pseudo-alignment information. We compute the co-occurrence between words by computing the similarity between documents that they appear with a threshold. Then combine the classic similarity based on lexical context with the similarity based on pseudo-alignment information to be a new quantity and use the new similarity to judge whether two words are translated by each other. Experiments on the English-French comparable corpus demonstrate that pseudo-alignment in comparable corpora is an essential feature leading to a significant improvement of standard method of lexicon extraction.(2) We improved the performance of cross language information retrieval model by combining the extracted bilingual lexicon with an information-based information retrieval model named log-logistic model that previous scholars and researches had proved it to perform better than other models. Experiments show a significant improvement of the performance of retrieval system by combining the extracted bilingual lexicon with it.
Keywords/Search Tags:Comparable corpus, Lexicon extraction, Cross Language information retrieval
PDF Full Text Request
Related items