Font Size: a A A

Mining Bilingual Resources In Encyclopedia And Its Application In Query Translation For CLIR

Posted on:2015-03-24Degree:MasterType:Thesis
Country:ChinaCandidate:Q W YanFull Text:PDF
GTID:2268330428972654Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In the information age, information retrieval on the Internet has become an important part of people’s daily life. The demand for information in other languages also grows. Cross language information retrieval provides an easy way for users to search information in different language with native language. Query translation is the most frequently used technology for CLIR. The translation of proper names, new words, idioms and technical terminologies is one of the key factors that affect the performance of the CLIR systems.To solve the problem of inadequate out of vocabulary, first mine translation templates, and then extract translation pairs from encyclopedia with learnt templates, thus expand the dictionary size. In the template mining module, separate the template of Chinese-English translation pair into five parts, use Pat-Array to extract the LCP (Longest Common Prefix) to form the template. In the query translation module, use a hybrid method of statistical machine translation and Example-based machine translation to translate the search term. Last, use lucene to create the inverted index of Chinese and English texts, and build a CLIR system based on query translation.The two innovations of this paper are as follows:This paper proposes a new method to automatically extract high quality translation pairs from Wikipedia based on the wide area coverage and data structure. The method contains three steps:First, extract translation pairs from the language toolbox of the Wikipedia. They can be heuristic for the next step; Second, learn templates of translation pairs with the knowledge gained from the previous work; Lastly, extract other translation pairs automatically using the learned patterns. Our experimental results show that the method not only can learn common patterns, but also learn many patterns that can hardly be found by human beings. Before verification the accuracy can reach76.63%, and grows up to90.4%after verification.To improve the accuracy of phrase translation, we adopt a hybrid method of Example-Based machine translation and statistical machine translation. We take morphology into account in the statistical machine translation, and transform searching for the best translation into re-ranking problem. The result shows that, the hybrid system has a better performance.
Keywords/Search Tags:CLIR, Wikipedia, Phrase Translation, Inverted Index
PDF Full Text Request
Related items