Font Size: a A A

Large-scale Bilingual Parallel Corpus Collection System Based On Hadoop

Posted on:2014-09-17Degree:MasterType:Thesis
Country:ChinaCandidate:Z C ZhangFull Text:PDF
GTID:2268330422450597Subject:Computer Science and Technology
Abstract/Summary:
In recent years, research methods based on statistics occupy the dominantposition in natural language processing, and translation methods based on examplesand statistics provide a new research approach for machine translation.In the field of machine translation, a corpus is an indispensable basis of statisticalmachine learning. In particular, a bilingual corpus provides basic resources for naturallanguage processing field such as machine translation, cross-language retrieval, etc.And it is also able to significantly improve the quality of machine translation.Meanwhile, the acquisition of translation knowledge from the corpus can lead to finerexcavation of translation lexicon and templates.As the rapid development and information explosion of the Internet, massiveonline web resources contain plenty of bilingual translation resources. Compared withother resources, bilingual translation resources from the Internet is more timely,covers broader areas, and has a larger volume.Study on the large-scale bilingual parallel corpus acquisition technology basedon web has important significance when solving bilingual corpus access problem, andpromoting the development of related technologies and make them more practical.The goal of this paper is to establish an internet-facing bilingual corpus acquisitionsystem based on hadoop distributed computing platform.We firstly introduce hadoop distrubuted technology: MapReduce computationmodel and HDFS distributed file system, and analyze the key technology in a crawlersuch as the task scheduling algorithm, duplicate removal, page updating andidentification. We discuss the performance and efficiency bottlenecks of the crawler,and based on that, we design a hadoop based, web-facing, large-scale, multi-languageweb crawler, and an incremental crawler. Then, this paper introduces a bilingualparallel sentence acquisition method. Finally, we estimate the scale of bilingualtranslation resources on the internet by sampling, and then demonstrate its validity.The significance of this paper is to put forward a method of collecting large-scalebilingual corpus resources from the Web, and realize the bilingual corpus collectionsystem on the hadoop distributed computing framework, which can effectively collectwebpages from the Internet, detect bilingual sites and crawl incrementally, so as tobuild a large-scale bilingual parallel corpus to support the research of machinetranslation. In addition, we study the updating patterns of bilingual webs,and throughexperiments designing, estimate the volume of bilingual corpus resources on the web,which has guiding significance to the related research of bilingual corpus acquisition.
Keywords/Search Tags:Hadoop, web crawler, incremental crawler, page updating, bilingualcorpus collection
Related items