Font Size: a A A

The Research Of Bilingual Corpus Mining Based On Wikipedia

Posted on:2011-10-06Degree:MasterType:Thesis
Country:ChinaCandidate:G G MengFull Text:PDF
GTID:2178330332466305Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Large-scale parallel or comparable corpus are essential resources in constructing high-performance statistical machine translation systems. Now there are great difficulties in building large-scale bilingual parallel or comparable corpus, the existing bilingual corpus still cann't meet the need of processing real text because of small scale, poor timeliness and un-balance of domains.This thesis focuses on researching a method to obtain large-scale bilingual parallel or comparable corpus from Wikipedia automatically and verifying it validity.We need to define heuristic information to obtain the web resources which we concerned from special web. In Wikipedia, parallel or comparable bilingual resources exist in two parallel or comparable monolingual web pages. In this paper, we defined effective heuristic information for these two kinds of resources to obtain more bilingual data.Website authors usually follow some rules in naming parallel or comparable bilingual webpages. We present an algorithm of URL's naming templates to find more candidate webpages automatically from Wikipedia.Web pages may consist of non-translational content and out-of-vocabulary words, both of which reduce sentence alignment accuracy and increase the difficulties. To improve sentence alignment performance on the web data, the similarity of the HTML tag structures between the parallel or comparable web documents may be helpful. Due to the noisy nature of web page, this thesis presents use the similarity of the HTML tag structures and DOM aligning algorithm to get a parallel or comparable sentences.Finally, we build a experiment platform to mine parallel or comparable bilingual corpus from Wikipedia automatically.
Keywords/Search Tags:Wikipedia, Bilingual Web, Web Mining, Bilingual Sentences, Statistical Machine Translation
PDF Full Text Request
Related items