Font Size: a A A

The Establishment And Development Of A WEB-based Chinese-english Parallel Corpus

Posted on:2015-11-13Degree:MasterType:Thesis
Country:ChinaCandidate:F LuoFull Text:PDF
GTID:2308330473457085Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In recent years, about parallel corpus research experts and scholars more and more inclined to the study of corpus linguistics. As a result, the experts and scholars in the field of natural language processing are aware of the great research value on the high quality, large-scale Chinese-English parallel corpus, in addition, in the field of comparative linguistics research and so on, the Chinese-English parallel corpus also play an important role.But Chinese-English parallel corpus on the scale and quality are far less than the monolingual corpus. With the development of Internet, different language exchanges become increasing frequent, the bilingual corpus has become indispensable important resources for Machine Translation, Machine Aided translation and translation knowledge acquisition. The bilingual corpus in the role of machine translation research has become increasingly obvious, as an important language resources, system of bilingual parallel corpus construction has not yet received sufficient attention in the Chinese domestic, basically still in theory.This thesis describe large-scale Chinese-English parallel corpus build system and use enormous multilingual resources of the internet by a web content analysis and links that a bilingual corpus excavating system is automatically came into being. The system adopts B/S structure, two subsystems, namely the crawler system and index system, loose coupling between subsystems, runtime not influence each other; Web crawler automatically will climb from the Internet to get eligible web page, after processing to the database, then use Lucene to index data in the database, using a predefined rules to query the index, the result is a parallel corpora. The system mainly deals with Chinese and English languages, but the language of the future system can be flexible configuration, on the basis of the slightly modified, it can be configured to build any two languages of parallel corpus.Bilingual Chinese-English parallel corpus build system in this paper uses MyEclipse development platform, then the front desk page choose Jsp dynamic web technology development, the backend database using an open source database MySql, the design pattern using the MVC. The purpose is through study the domestic and foreign related research achievements of predecessors, then research methods and processes about using the latest dynamic web development technology to develop the automatic Chinese-English parallel corpus build system. The system can collect a lot of Chinese-English parallel corpus for future and provide the language support for the online Chinese-English translation. The research is used for automatic build of Chinese-English parallel corpus, on this basis development of Chinese-Uighur parallel corpus automatic build system in the future.
Keywords/Search Tags:Crawler, Parallel corpus, Lucene, Chinese-English
PDF Full Text Request
Related items