Font Size: a A A

Cross Language Information Retrieval Based On Topical Pseudo Relevance Feedback

Posted on:2015-08-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:X W WangFull Text:PDF
GTID:1228330467963667Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Information retrieval has brought great convenience for network users. The large scale resource of multilingual web pages has also lead to the growing demand for cross language information retrieval (CLIR). In CLIR tasks, queries and retrieval results come from different languages and the problem of word mismatch is particularly serious. In the case of the limitation of translation, it is an important and applicable research to improve the cross-lingual retrieval performance by combining existing technologies in monolingual information retrieval together with CLIR.Relevance feedback is one of the key technologies for improving the relevance of retrieval results and increasing users’satisfaction. It has achieved great success in monolingual information retrieval. More and more attention have been paid to the design of effective multilingual relevance feedback mechanism, which can utilize the potential correlation of different language fields, and provide valuable messages for CLIR tasks.In this thesis, we mainly explored different multilingual pseudo relevance feedback (PRF) methods for the task of cross language information retrieval. And the key problem was to choose useful multilingual terms for expanding the source language queries and their translations. Then cross language retrieval results were promoted based on these optimized queries. Existing cross-lingual relevance feedback processes were performed from the document level, and high-frequency words in pseudo relevant documents were adopted as expansion terms. The thesis performed cross-lingual pseudo relevance feedback on fine-grained topical level. A few problems were studied in our work, such as the quality of feedback information, the latent correlation between multilingual content and the usefulness of query expansion terms. A series of topical PRF models were put forward for CLIR, and the selection of multilingual expansion words were also discussed. The main work of this dissertation can be summarized as follows:A topic-based cross-lingual pseudo relevance feedback model was proposed for CLIR. Existing PRF strategies are mainly based on retrieved documents, which are rich in content and may contain noisy information. The topic-based PRF model was applied on the source language retrieval documents before the query translation stage, and also on the target language retrieval documents after query translation. According to the topic distribution, topics containing query terms with high probabilities were considered as semantically relevant information. Under the same experimental condition, it was shown that the topic-based cross-lingual PRF model was more robust than the document-based PRF models. Compared with the cross language retrieval results without introducing PRF mechanism, CLIR using the topic-based PRF model increases nDCG by1.3percent, while the one using document-based PRF model increases nDCG by0.4percent. The document-based PRF models may introduce too much noisy to CLIR, and the topical feedback information is more reliable for CLIR.A cross-lingual pseudo relevance feedback model based on bilingual topics was proposed. Compared with monolingual retrieval tasks, CLIR can get bilingual retrieval results, which provide a new resource for improving PRF performance. Bilingual documents retrieved by the same query were considered as comparable content. The topic model was extended to the bilingual topic model for modeling source language documents and target language documents at the same time. Then queries and their translations were expanded on the basis of the common bilingual topics. Compared with existing monolingual PRF methods by stages, the bilingual topic-based PRF model has utilized the relevance between multilingual documents and has achieved better performance on highly comparable bilingual documents. Experiments showed that CLIR using the bilingual topic-based PRF model increases nDCG by more than2.4percent. A cross-lingual pseudo relevance feedback model based on weak-relevant topic alignment was proposed for retrieval documents of poor comparability. If the bilingual retrieval results of one query are uncomparable, it is difficult for the bilingual topic-based PRF model to get common bilingual topics of strong correlation. So in this weak-relevant topic-based PRF model, topics in different languages were aligned on the basis of translation. A multilingual similarity evaluation score based on translation as well as web co-occurrence features was applied to extract useful expansion terms from weak-relevant topics. Experimental results showed that CLIR using the weak relevant topic-based PRF model increases nDCG by more than6.4percent. The PRF model based on weak-relevant topic alignment could effectively restrain the query drift problem, and was more suitable for CLIR on uncomparable or weak-comparable documents.A Chinese-English cross language information retrieval system integrated with cross-lingual pseudo relevance feedback mechanism, namely CTP-CLIR system, was implemented. The CTP-CLIR system could access web pages to collect free corpus and retrieve local multilingual database automatically. Various cross-lingual PRF models proposed in this thesis were all integrated in this system. So the Chinese-English CLIR was implemented with high quality based on multiple cross-lingual PRF mechanisms.
Keywords/Search Tags:pseudo relevance feedback, cross language informationretrieval, topic model, query expansion, multilingual relevance
PDF Full Text Request
Related items