Large-scale Bilingual Parallel Corpus Collection System Based On Hadoop

Posted on:2014-09-17

Degree:Master

Type:Thesis

Country:China

Candidate:Z C Zhang

Full Text:PDF

GTID:2268330422450597

Subject:Computer Science and Technology

Abstract/Summary:

In recent years, research methods based on statistics occupy the dominantposition in natural language processing, and translation methods based on examplesand statistics provide a new research approach for machine translation.In the field of machine translation, a corpus is an indispensable basis of statisticalmachine learning. In particular, a bilingual corpus provides basic resources for naturallanguage processing field such as machine translation, cross-language retrieval, etc.And it is also able to significantly improve the quality of machine translation.Meanwhile, the acquisition of translation knowledge from the corpus can lead to finerexcavation of translation lexicon and templates.As the rapid development and information explosion of the Internet, massiveonline web resources contain plenty of bilingual translation resources. Compared withother resources, bilingual translation resources from the Internet is more timely,covers broader areas, and has a larger volume.Study on the large-scale bilingual parallel corpus acquisition technology basedon web has important significance when solving bilingual corpus access problem, andpromoting the development of related technologies and make them more practical.The goal of this paper is to establish an internet-facing bilingual corpus acquisitionsystem based on hadoop distributed computing platform.We firstly introduce hadoop distrubuted technology: MapReduce computationmodel and HDFS distributed file system, and analyze the key technology in a crawlersuch as the task scheduling algorithm, duplicate removal, page updating andidentification. We discuss the performance and efficiency bottlenecks of the crawler,and based on that, we design a hadoop based, web-facing, large-scale, multi-languageweb crawler, and an incremental crawler. Then, this paper introduces a bilingualparallel sentence acquisition method. Finally, we estimate the scale of bilingualtranslation resources on the internet by sampling, and then demonstrate its validity.The significance of this paper is to put forward a method of collecting large-scalebilingual corpus resources from the Web, and realize the bilingual corpus collectionsystem on the hadoop distributed computing framework, which can effectively collectwebpages from the Internet, detect bilingual sites and crawl incrementally, so as tobuild a large-scale bilingual parallel corpus to support the research of machinetranslation. In addition, we study the updating patterns of bilingual webs,and throughexperiments designing, estimate the volume of bilingual corpus resources on the web,which has guiding significance to the related research of bilingual corpus acquisition.

Keywords/Search Tags:

Hadoop, web crawler, incremental crawler, page updating, bilingualcorpus collection

Related items

1	Research And Implementation Of Distributed Web Crawler Based On Hadoop
2	Research On Topic Focused Web Crawler And Related Technologies
3	Research Of Internet Information Collection System Based On Cloud Platform Web Crawler
4	Research On Web Page Classification And Information Collection
5	Design And Implementation Of A Distributed Web Crawler System Based On Hadoop
6	Research On Optimization Of Hadoop Distributed Web Crawler System
7	Investigation On Web Crawler Technology Based On Hadoop Platform
8	Research On A Method Of Focused Crawler Based On Page Partition
9	Design And Implementation Of Network Analysis Based On The Page Crawler
10	Research On Focused Crawler Based On Page Segmentation