Font Size: a A A

The Research And Implementation Of The Key Technologies On The Acquisition System Of The Bilingual Corpus

Posted on:2013-02-14Degree:MasterType:Thesis
Country:ChinaCandidate:H J AiFull Text:PDF
GTID:2268330401967066Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of the technology of statistical natural language processing,the role of parallel corpus in statistical machine translation and cross-language retrievalcannot be ignored. Existing Parallel Corpus still unable to meet the requirements ofpractical applications, bilingual corpus becomes the bottleneck of the development ofstatistical machine translation systems and the cross-language information retrieval.Now domestic and foreign researchers is paying attention to the further research ofbilingual corpus. Present English-Chinese corpus mainly concentrated in well-knownliterary translation such as government documents, press law and other special areas,so this imbalance of this field reduces the level of research in the practical application,allowing the corpus-based research into a bottleneck. In order to reduce the difficultyof the manual work in the searching bilingual corpus, it is necessary to study anefficient bilingual corpus-building programs, And can be easily applied to variousfields of work, to replace the previous manual way to get a bilingual corpus. To solvepractical problems the exact solution of the relevant research and development has avery important practical significance. Providing accurate solutions for solving practicalproblems has very important practical significance for related research anddevelopment.This thesis proposes several methods for bilingual corpus form different websites,such as Automatic acquisition of bilingual corpus base on “iciba” web, CNKI andPatent network. It introduced methods, procedures of the acquisition of a variety ofcorpus.We proposed different methods to obtain the bilingual corpus for differentcharacteristics of different sites, and achieved fast and accurate automatic access of alarge-scale bilingual corpus. When we obtain the bilingual corpus based on “iciba”web, the main method is Nutch crawler, which is relatively good, and has anaccurate retrieve and a good correlation. In addition, we give up the idea of bilingual corpus obtained from the entire internet, but we use an entirely new access, that is toaccess to the basic information of scholarly thesises in the CNKI to obtain thelarge-scale high-quality English-Chinese bilingual corpus.We obtain GB level of large-scale bilingual aligned corpus in the end, which isvery accurate by the manual evaluation. And the corpus makes preparation for thefurther cross-language information retrieval research.
Keywords/Search Tags:bilingual corpus, machine translation, cross-language retrieval, getcorpus
PDF Full Text Request
Related items