Font Size: a A A

Research On Large-Scale Bilingual Parallel Corpus Extraction From The Web

Posted on:2013-12-08Degree:MasterType:Thesis
Country:ChinaCandidate:Y H FengFull Text:PDF
GTID:2248330371993546Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Large-scale bilingual parallel corpus can benefit many Natural Language Processing (NLP) applications, such as machine translation and cross-language information retrieval. There are massive multilingual text resources on the Web, and most of previous research just focused on extraction bilingual parallel resources from parallel monolingual page pairs. Although a lot of manpower, material and financial has been spent in extracting such bilingual resources, the existing corpus collected is far away from enough to be used in real text processing, just because of its small scale, poor timeliness and imbalance of domains. Currently, researchers find that parallel bilingual resources exist not only in parallel monolingual page pairs, but also in a single bilingual page, and bilingual pages contain more parallel resources with higher translation quality and more domains. In this paper, we only focus on such bilingual pages and propose to obtain large-scale bilingual parallel corpus automatically from the Web. Our research result can be summarized as follows:(?) Discovering bilingual pages from the WebThe Web contains massive pages, so it is a big challenge to discover bilingual pages accurately. Previous researches always adopt methods based on defined targets, i.e. firstly collect plenty source Web sites (such as English learning site and translation site and so on), then download all internal pages as candidate bilingual pages. However, the work about collecting source sites must be with human intervention and only output limited candidates. In order to overcome such disadvantages, other researches propose to automatically discover source sites by use of search engines and heuristic information, while such methods output so many noisy pages and mined parallel resources of poor quality. This paper firstly proposes o discover and extract bilingual pages by use of search engines and acquired small-scale parallel corpus, and experimental results show that that does be a novel method to acquire high-quality bilingual pages quickly, accurately and persistently. (?) Improving bilingual parallel sentence extraction and alignment methodsBilingual pages contain not only valuable bilingual parallel resources, but also some noise, such as advertisements, navigations and so on. In addition, such resources are displayed in different forms and there are so many out of vocabulary words in them. All these have taken great trouble to acquire parallel resources. This paper proposes methods to mine parallel resources by automatically learning their existing forms and improve translation quality of the extracted parallel corpus based on length, bilingual dictionary and translation models.
Keywords/Search Tags:Web information mining, bilingual parallel corpus, parallel resourcesalignment, bilingual Web page acquisition and machine learning
PDF Full Text Request
Related items