Font Size: a A A

Statistical Machine Translation, Corpus Methods Of Research

Posted on:2011-10-29Degree:MasterType:Thesis
Country:ChinaCandidate:H Z LiFull Text:PDF
GTID:2208360305473790Subject:Computer applications
Abstract/Summary:PDF Full Text Request
Among the translation methods that are based on statistics, the scale and quality of the bilingual corpus, as well as the accuracy of word-alignment algorithm have significant influence on the function of translation system. Large scale linguistic corpus can enhance the accuracy of word-alignment algorithm and improve the function of the system, but at the same time in sacrifice of increase in the load of the system. Therefore currently the research in the statistical machine translations should not merely blindly enlarge the quantity of the existing corpus, in order to fortify the function of SMT system.To solve the problems above, this paper brings forward the solutions as follows: Ameliorating the capability of computer-based translation by better employment of existing parallel training corpus. We preprocess the corpus via adapting the machine study; distinguish the corpus to literal translation pairs and free translation pairs. This paper innovatively classifies training data with the method which combines grammatical compatibility with lexical compatibility and improve the method that to improve cross-lingual word kernel with linguistics knowledge, and improve SMT model by classifying corpus training. Experiments show that literal translation matches the intellectual level of SMT system more perfectly. If tremendous words are gained in a large amount of corpus translation, weighted training date could boost the translation contribution. Detailed analysis shows that, if some heuristic pruning algorithm is used to avoid the incensement of the probability of OOV, literal translation corpus will exert more contribution on the SMT model.
Keywords/Search Tags:Statistical Machine Translation(SMT), Bilingual Corpus, SVM, Corpus Select
PDF Full Text Request
Related items