Font Size: a A A

Web-oriented Multilingual Parallel Sentence Pairs Mining Techniques

Posted on:2015-01-01Degree:MasterType:Thesis
Country:ChinaCandidate:L L WangFull Text:PDF
GTID:2298330422990907Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Bilingual parallel corpus is an important resource for statistical MachineTranslation and a series of related research and application. Traditionally, checkingand inputting parallel sentences by human not only costs too much time andlaborious, but also is difficult to build a large-scale parallel corpus in limited time.With the development of the Internet, various bilingual or multilingual website waspublished and many researchers began to study how to get bilingual parallel corporafrom the internet. However, past research mainly focus on mining bilingual parallelcorpus from parallel webpage (two webpages contain same content written in twodifferent languages). Due to the scarcity of parallel webpage, the scale or domaincoverage of the corpus is not very good. Later, some academician found that thereare large scale mixed webpages (one webpage contain two same content written intwo different languages) in the web and their quality, domain coverage are betterthan parallel webpage. Therefore, this study is committed to build automatic systemto obtain bilingual parallel corpus from mixed webpages.Our study involves the following aspects:(1) We summarize the present research situation of bilingual parallel corpusand bilingual parallel corpus construction method at home and abroad. At present,the research of bilingual parallel corpus focuses on the data processing, such ascorpus annotation, translation knowledge acquisition etc. At the same time, theparallel corpus construction has been concentrated on English-Chinese parallelcorpus, the construction of large-scale original bilingual corpus was not givenenough attention.(2) Based on parallel corpora construction method, we realize a system whichcan mine parallel corpora from Web automatically. This system uses widely existand high value mixed webpage as the resource of bilingual parallel sentences andthe technical difficulties in this system are obtaining candidate webpages, bilingualwebpage detection, webpage text analysis, parallel sentence alignment. This systemadopts the candidate mixed webpage detection and download by using searchengine, mixed webpage detection uses the ratio of the length of the differentlanguage texts, text extraction is based on HTML label analysis, parallel sentence alignment uses the fusion of HTML label features. The accuracy of bilingualwebpage detection is above95%, webpage text parsing accuracy can reach88%andparallel sentence alignment accuracy can reach90%.(3) Using the bilingual corpus obtained in this paper, we built a multilingualwebpage retrieval system. The system uses English as the intermediate language forprocessing the user query. In a simple test, the search result returned by the systemconforms to the requirement basically.
Keywords/Search Tags:Bilingual parallel corpus, mixed webpage, obtain corpus, multi-integrated retrieval
PDF Full Text Request
Related items