Font Size: a A A

Self-Correction Of Word Alignments System

Posted on:2018-12-29Degree:MasterType:Thesis
Country:ChinaCandidate:H M GongFull Text:PDF
GTID:2348330542465252Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The core idea of statistical machine translation is to analyze the large number of bilingual parallel corpus,and then construct the statistical translation model to translate the test text.Bilingual word alignment is a very important part of statistical machine translation system.It is a prerequisite for generating the phrase table and the extraction of rules.The accuracy of word alignments is significant to the performance of statistical machine translation system.Also,the word alignments information is based on the bilingual sequence of statistical information without involving the bilingual hierarchy and language features,which would cause some word alignments errors,data sparse problem and other issues.Since it searches all possible word alignments while aligning the sentence pairs,some word alignments not conforming to the linguistic features are always incorporated into the aligning search space and may be output due to a larger statistical probability.In this paper,the study is based on bilingual hierarchical structure and linguistic features to improve the quality of word alignments and the performance of machine translation system.We propose a self-correcting mechanism of word alignments,which introduces a loop feedback mechanism based on traditional word alignments.It is able to re-plan the aligning search space based on the output of last round,then the incorrect word alignments can be avoided.In the loop feedback mechanism,the sentence pairs are divided according to different hierarchical structures,gradually transiting from sentence level structures to clauses and phrases level.The main works are as follows:(1)Judging the non-parallel relationship of sentence pairs.Since the alignment quality of Chinese and English corpora used for training is unknown,it is necessary to judge the alignment of sentence pairs based on the traditional word alignment informationin order to assure the validity of the binary segmentation method.(2)Locating the best segmentation point is the core component of the word alignment self-correction algorithm.A good partition point can effectively segment the complex sentence pairs and correct the original word alignment errors,and improve the quality of machine translation.In the algorithm,three methods are proposed:The binary segmentation method based on the punctuation,which selects the best partition point among all possible punctuation combinations.The binary segmentation method based on the related words,which uses the characteristic words of the related sentence components in the sentence as the the segmentation basises to divide the sentence pairs into fine parts.The binary segmentation method based on the statistical features,which adds the syntactic structure features based on the above methods to find the best segmentation point.The Gibbs sampling method is adopted to select the best segmentation position by the distribution of the statistical characteristic probability.The accuracy of the partition is improved and the rate of word alignment errors is reduced.(3)Identifying and correcting non-parallel relations.After obtaining the best segmentation points,we calculate the density and the error rate of the traditional word alignments,and determine whether to segment and correct the sentence further.Then the sub-pairs are used to run Giza++ and obtain the new word alignments,which will be merged according to the partition position.Finally,the quality of machine translation is improved.
Keywords/Search Tags:Self-Correction, Word alignment, Binary segmentation, Gibbs sampling
PDF Full Text Request
Related items