Mining Bilingual Resources In Encyclopedia And Its Application In Query Translation For CLIR

Posted on:2015-03-24

Degree:Master

Type:Thesis

Country:China

Candidate:Q W Yan

Full Text:PDF

GTID:2268330428972654

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

In the information age, information retrieval on the Internet has become an important part of peopleâ€™s daily life. The demand for information in other languages also grows. Cross language information retrieval provides an easy way for users to search information in different language with native language. Query translation is the most frequently used technology for CLIR. The translation of proper names, new words, idioms and technical terminologies is one of the key factors that affect the performance of the CLIR systems.To solve the problem of inadequate out of vocabulary, first mine translation templates, and then extract translation pairs from encyclopedia with learnt templates, thus expand the dictionary size. In the template mining module, separate the template of Chinese-English translation pair into five parts, use Pat-Array to extract the LCP (Longest Common Prefix) to form the template. In the query translation module, use a hybrid method of statistical machine translation and Example-based machine translation to translate the search term. Last, use lucene to create the inverted index of Chinese and English texts, and build a CLIR system based on query translation.The two innovations of this paper are as follows:This paper proposes a new method to automatically extract high quality translation pairs from Wikipedia based on the wide area coverage and data structure. The method contains three steps:First, extract translation pairs from the language toolbox of the Wikipedia. They can be heuristic for the next step; Second, learn templates of translation pairs with the knowledge gained from the previous work; Lastly, extract other translation pairs automatically using the learned patterns. Our experimental results show that the method not only can learn common patterns, but also learn many patterns that can hardly be found by human beings. Before verification the accuracy can reach76.63%, and grows up to90.4%after verification.To improve the accuracy of phrase translation, we adopt a hybrid method of Example-Based machine translation and statistical machine translation. We take morphology into account in the statistical machine translation, and transform searching for the best translation into re-ranking problem. The result shows that, the hybrid system has a better performance.

Keywords/Search Tags:

CLIR, Wikipedia, Phrase Translation, Inverted Index

PDF Full Text Request

Related items

1	Research On Chinese Complex Noun Phrase Translation Extraction Based On Multi-strategy
2	Research And Applications On Phrase-Oriented Neural Machine Translation
3	Discontinuous Phrase Template Extraction And Phrase Combination In Phrase-Based Statistical Machine Translation
4	Research On Translation Methods Of Query Items In Chinese-Mongolian Cross-Language Information Retrieval
5	Research And Implementation Of Hierarchical Phrase-Based Translation Model In Statistical Machine Translation
6	Research And Implementation Of Hierarchical Phrase-based Translation Model In Statistical Machine Translation
7	Research On Term Automatic Translation Technology Based On NP Tree For English Patent Documentation
8	Translation Knowledge Acquisition In Corpus-based Machine Translation
9	Research On Phrase-based Statistical Machine Translation
10	Research On Translation Rule Constraint Problems In Hierarchical Phrase Based Translation Model