Chinese-English Bilingual Corpora Acquisition From The WEB

Posted on:2013-11-14

Degree:Master

Type:Thesis

Country:China

Candidate:Y Lin

Full Text:PDF

GTID:2248330371967440

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the development of Internet technology, the information on the web is now increasing in an explosive way, therefore how to extract the specific and useful information in an automatic or a semi-automatic way become a problem. Among this, Chinese-English bilingual corpora are an very important resource in Nature Language Process Research, they are useful in machine learning, machine translation, Bilingual information retrieval and so on. Massive Bilingual corpora have great importance in the work of improve the accuracy of Statistical Machine Translation. And, there are lots of bilingual corpora of different forms and different qualities on the web now, so how to extract massive and high quality corpora from the web is now becoming a more and more important task.This paper presents a method of getting massive Chinese-English bilingual corpora from the web. While considering of extracting the main body of a webpage, besides, consider the particularity of the page that contains the Chinese-English bilingual corpora, first of all, process the html source code to get text line, then use the title of the page and the text thickness to determine the roughly area that include the main body, on this basis, clean the pages and filter the contents that we got, delete the pages whose ratio of the Chinese and English words number are out of the range we give, and save those fit the rules, thus we get Chinese-English bilingual corpora. Experiments show that our method is useful in getting massive and high quality Chinese-English bilingual corpora.

Keywords/Search Tags:

information explosion, web page cleaning, web content extract, Chinese-English bilingual corpora

PDF Full Text Request

Related items

1	Web - Based English - Chinese Bilingual Parallel Sentences
2	Research On Construction And Application Of English-Chinese Comparable Corpora
3	Research On Key Technology In Mining Web Bilingual Corpora
4	Chinese-english Bilingual Corpus Automatically Aligned
5	Research On Bilingual Lexicon Construction Between Chinese And English From Comparable Corpora
6	The Study Of The Alignment Method In The Chinese-English Parallel Corpora
7	Research On Web-based Chinese-English Bilingual Dictionary Generation
8	A Study On The Quality Of Chinese And English Bilingual Broadcasting
9	Today's Chinese-english Bilingual Radio Host Teaching Mode
10	Creating Chinese-English Comparable Corpora