Font Size: a A A

Chinese-English Bilingual Corpora Acquisition From The WEB

Posted on:2013-11-14Degree:MasterType:Thesis
Country:ChinaCandidate:Y LinFull Text:PDF
GTID:2248330371967440Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of Internet technology, the information on the web is now increasing in an explosive way, therefore how to extract the specific and useful information in an automatic or a semi-automatic way become a problem. Among this, Chinese-English bilingual corpora are an very important resource in Nature Language Process Research, they are useful in machine learning, machine translation, Bilingual information retrieval and so on. Massive Bilingual corpora have great importance in the work of improve the accuracy of Statistical Machine Translation. And, there are lots of bilingual corpora of different forms and different qualities on the web now, so how to extract massive and high quality corpora from the web is now becoming a more and more important task.This paper presents a method of getting massive Chinese-English bilingual corpora from the web. While considering of extracting the main body of a webpage, besides, consider the particularity of the page that contains the Chinese-English bilingual corpora, first of all, process the html source code to get text line, then use the title of the page and the text thickness to determine the roughly area that include the main body, on this basis, clean the pages and filter the contents that we got, delete the pages whose ratio of the Chinese and English words number are out of the range we give, and save those fit the rules, thus we get Chinese-English bilingual corpora. Experiments show that our method is useful in getting massive and high quality Chinese-English bilingual corpora.
Keywords/Search Tags:information explosion, web page cleaning, web content extract, Chinese-English bilingual corpora
PDF Full Text Request
Related items