Font Size: a A A

Study On Web-based Translation Technology For Out-of-Vocabulary

Posted on:2012-08-23Degree:MasterType:Thesis
Country:ChinaCandidate:C L SunFull Text:PDF
GTID:2218330368492438Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The query translation of Out of Vocabulary (OOV) is one of the key factors affecting cross-language information retrieval (CLIR). Its translation quality has direct impact on natural language processing performance such as machine translation, cross language retrieval and so on.Take full advantage of the rich network of resources to achieve the OOV translation mining, simultaneously, combine with the merits of Wikipedia and Search Engines respectively. The specific steps as follows: Abbreviation is one kind of OOV, because of the ambiguity of abbreviations, so there are often a variety of full forms of abbreviations, therefore it is very essential to recognize the abbreviation and withdraw the full titles. This paper achieves full extraction of abbreviations based on the search engines and Wikipedia. Then divide the alignment of the query words and the chapters of Wikipedia into two types roughly, namely, named target link and non-target link alignment. As vocabulary entry which has the target link alignment, complete OOV translation through the target language link title extraction; for non-target link aligned entry, realize its translation excavation through the search engine.First of all, realize query words expansion of cross-language, have access to obtain high quality bilingual summary resources. Source language entry which exists in the Wikipedia, take the extraction of hyperlink target language titles in the corresponding summary chapter as the expansion of cross-language word. Secondly, in the case of missing target, use the search engine feedback for bilingual context of co-occurrence, and then based on keyword translation inquiry expansion method of co-occurrence information, translation by secondary dictionary, resubmits query construction based on OOV and expansion word fused to the search engine; Finally, adopt the reduction hierarchical clustering algorithm of log likelihood ratio value to achieve the extraction of the candidate multi-word units, and compared with the commonly used statistical methods; Finally, the comprehensive utilization of frequency - distance model, the surface layer template matching model and transliteration model, choose the best candidate translations for query translation from the translation unit. The experimental results prove that Top10 can achieve the correct translation rate of 93.8%.
Keywords/Search Tags:Cross-Language Information Retrieval, Query Translation, OOV, Search Engine, Wikipedia Written by Sun Changlong
PDF Full Text Request
Related items