Font Size: a A A

Word Pair Extraction And Web-based Mining Of OOV Translations

Posted on:2010-12-29Degree:MasterType:Thesis
Country:ChinaCandidate:J SunFull Text:PDF
GTID:2178360275459235Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
OOV(Out-Of-Vocabulary) identification has been a difficult problem in the field of Chinese Information Processing,and the translations of OOVs are very important in Natural Language Processing.In the field of Cross-Language Information Retrieval (CLIR),Question Answering System(QA) and so on,the correctness of OOV translations will directly affect the final performance of the applications.This thesis introduces three kinds of methods to extract word pairs,called unsupervised method,supervised method and semi-supervised method,respectively, according to whether the corpus has been annotated.We use 12 kinds of frequency-baed and 2 kinds of context-similarity-based measure metrics to score the extracted word pairs. Experimental results show that the optimal method is to extract word pairs with semi-surpevised method and simply use frequency metric to socre.Web-based mining of OOV translations is the focus of this thesis.Firstly,OOVs are classified into literal words and non-literal words,and their direct English expansions or co-occurrence English expansions are acquired.Then OOVs and their English expansions are fed into search engine.The translations of OOVs could be mined from the returned serach pages.The returned pages are preprocessed to filter the interference information before we extract the candidate translatins.And then the candidates are sorted according to their frequency and distance informations and so on.We also use Forward Maximal Matching Weighting and Word Alignment techniques to ensure the former position of the correct translations.Experiments show that our mining method of OOV translations is feasible and efficient.The Top 1 Coverage of mined translations is more than 80 percent, while the Top 5 Coverage is approaching or has reached 100 percent.
Keywords/Search Tags:Data Mining, Machine Translation, Word Pair Extraction, OOV Translation, Keyword Expansion, Web
PDF Full Text Request
Related items