Research On Parallel Resources Mining From The Internet

Posted on:2011-03-20

Degree:Master

Type:Thesis

Country:China

Candidate:Z X Yan

Full Text:PDF

GTID:2178360305476162

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Large-scale bilingual corpus can benefit many Natural language processing (NLP) applications, such as machine translation and cross language information retrieval. Although many previous studies have spent a lot of manpower, material and financial resources in obtaining bilingual corpora, the existing corpus acquired is far from enough to meet the need of processing real text because of its small scale, poor timeliness and imbalance of domains. In this paper, we focus on building a platform to obtain large-scale bilingual parallel corpus automatically.As far as we know, there are considerable duplicated texts in the whole Internet, so we need to combat this problem with the purpose of extracting bilingual resources. By searching similar articles with keyword and acquiring repetitive paragraphs from these similar articles, we can keep the balance of the precision rate and recall rate effectively. In this paper, keywords are selected by an iterative method between sentences and words. In a word, eliminating duplicated web pages laid a solid foundation for improving the efficiency of bilingual extraction.There are plenty of parallel sentences embedded in the bilingual mixed web pages. In this paper, we aim to extract parallel sentence pairs by analyzing the structure and the content of a given page fully. At first, we acquire bilingual mixed web pages based on usual text search engines, and then with regard of each bilingual page, we collect candidate parallel sentences by segmenting the whole page into different data regions. In order to align these candidate resources, we will make use of word-overlap, length-based measure and M-N HTML tag node alignment. Finally, the pair sentences included in the candidate bilingual web pages is verified by a maximum entropy classifier combining length, word-overlap, text location and alignment features.Website authors usually follow some rules in naming parallel bilingual webpages. Based on this observation, we present an algorithm to locate parallel pages within bilingual web sites. In addition, the issue of identifying parallel texts is seen as a classification problem, which is closely related to the length of page's content, word-overlap in anchor text and word alignment. In this paper, we locate bilingual web site by pre-defined string patterns and edit distance.

Keywords/Search Tags:

Parallel Resource, Web Mining, Parallel Sentence, Web page de-duplicate

PDF Full Text Request

Related items

1	Extracting Parallel Sentence From Large Scale Web Data
2	Web - Based English - Chinese Bilingual Parallel Sentences
3	A Study On The Key Technologies Of Web-Based Indonesian-Chinese Parallel Corpus Construction
4	Parallel Corpus Mining System Based On Cross-lingual Sentence Embedding
5	Research On Large-Scale Bilingual Parallel Corpus Extraction From The Web
6	Web-oriented Multilingual Parallel Sentence Pairs Mining Techniques
7	The Acquisition Of Parallel Sentence For Statistical Machine Translation Based On Internet
8	Comparable Corpus Acquisition Of Cambodian-Chinese Parallel Sentence Pairs Based On Bidirectional Recurrent Neural Network
9	Parallel Data Mining Theory Research And Application
10	Research On The Automatic Construction Of Chinese-Japanese Parallel Corpus