Font Size: a A A

Research On Bilingual Corpus Quality Assessment Techniques For Statistical Machine Translation

Posted on:2015-04-02Degree:MasterType:Thesis
Country:ChinaCandidate:Q DuFull Text:PDF
GTID:2348330482960376Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The statistical machine translation is a method to analyze from the large-scale parallel corpus, build statistical model, and translate with model of translation. So the basis of building a high quality statistics machine translation system is large and high quality of bilingual parallel corpus. But because the statistical machine translation system training needs very large corpus. Most data will contain a large number of errors or noise, which greatly affects the performance of statistical machine translation system. And it is a time-consuming and laborious work for us to screening the corpus of high quality through artificial means. Therefore, data quality evaluation of parallel bilingual corpus by automatic method, which can get high quality bilingual parallel sentence is very an important research project. This thesis first improves the length ratio based data quality evaluation method. The traditional length ratio based data quality evaluation method selects data directly, but we cannot control the amount of data in this method. Therefore, this thesis puts forward the improved method by sorting, and the validity of the method is proved by experiment.Secondly, this thesis improves the dictionary translation based data quality evaluation method. The traditional dictionary translation based method only considered the one-way dictionary translation ratio. In this thesis, the method is improved by considering the two-way translation probability, and further optimizes through generalization, stemming, disable word. And the effectiveness of this method is proved by experiment.At last, this thesis proposes a method for evaluating the quality of statistical machine translation data basing on the idea of translation. Forced decoding technology is applied to the data quality detection in statistical machine translation. And the method requires no additional data resources, and all the data needed is from the data to be evaluated. Data quality evaluation by forced decoding can reach the performance even more than all the data in reducing the data size of the case. At the same time, this thesis also do much comparison and analysis through manual evaluation of the filtered data and the non-filtered data, the experiments prove that the filtering method, can make the data quality from 56% to 87.7%. To further improve the method, it is still needed to study. And further improved of this method will be done in the future.
Keywords/Search Tags:statistical machine translation, quality assessment, bilingual corpora, forced decoding, data filtering
PDF Full Text Request
Related items