Font Size: a A A

Research And Implementation Of Bilingual Corpus Mining On The Internet

Posted on:2012-01-25Degree:MasterType:Thesis
Country:ChinaCandidate:Y WangFull Text:PDF
GTID:2178330332967386Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of Internet and the growing global integration, more and more bilingual websites were set up. Since Computer Aided Translation has become the focus of research, bilingual corpus gradually becomes a new emerging research field of natural language processing. Bilingual corpus not only provides an essential training data for Computer Aided Translation models and Machine Translation models, but also is an important basic resource of applications such as bilingual instruction, lexicography and cross-language information retrieval.It's hard to get massive bilingual corpus. Here the automatic mining of bilingual corpus is archived by URL similarity and structure similarity of webpages, later the feedback of sentence alignment is added to enhance the quality of bilingual corpus. The way of acquisition of bilingual webpages by wget under Linux system is discribed, and further sentence alignment is done using these webpages.Sentence alignment of webpages has always been an international active research field. There are already a lot of alogrithms put forward. Based on current research result of sentence alignment, characteristics of sentences in both English and Chinese, and that of bilingual webpages are analysed in this paper. Alignment method based on both length and HTML tags is used. In order to optimize sentence alignment, the strategy that manual tuning to modify sentence alignment is made. Finally, alignment result based on Chinese and English bilingual webpages is obtained from several different experiment conditions. Compared with some existing methods, data indicates that the method of using HTML tags has superior precision ratio and recall ratio.
Keywords/Search Tags:Computer Aided Translation, Bilingual Corpus, Bilingual Webpages, Webpage Marks, Sentence Alignment
PDF Full Text Request
Related items