Font Size: a A A

Research On Some Key Aspects Of Statistical Machine Translation

Posted on:2008-01-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y Z XueFull Text:PDF
GTID:1118360245496624Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Machine translation is the use of a computer to translate one natural language into another, which can be viewed as a decision problem. The major directions of research in machine translation include rule-based, interlingua-based, example-based and statistical methods. Currently, statistical machine translation shows its benefits and has received much attention. Statistical translation models involve word-based, phrase-based and syntax-based models. In this paper, some key techniques of phrase and syntax-based models are carefully studied. As a first step, three classical machine translation methods are systematically compared and the advantages and disadvantages of these methods are discussed in detail. On this basis, the problem of efficient extraction of bilingual phrase translation pairs is studied. As to syntax-based method, focuses are placed on the decoding problem, which leads to direct decoding algorithms. In the meanwhile, a syntax-based reordering model is also presented for phrase-based statistical machine translation. Finally, a brief translation approach based on information extraction is proposed, in which the long comings of statistics and rules are combined. This thesis is arranged as follows:1. The classical approaches of statistical machine translation are analyzed, and the new strategy that is different from the classical approaches is tried. By analyzing the experimental results, the long-comings and shortcomings of these approaches are pointed out. Especially, further analysis is made on the conventional syntax-based statistical machine translation. Then, a framework for refinement is proposed as a preparation for further studies, which presents the strategies for incorporating the syntax into the phrase-based method, and combining the statistical approach and the rule-based approach.2. Extraction methods of phrase translation pairs from n-best alignments are studied. A loose phrase extraction method is proposed, and constraints of extraction are applied to further improve the effect of phrase extraction. The proposed constraints include the constraint based on intersection of alignment points and the constraints based on words similarities. For the latter, three metrics, dice coefficient, phi-square coefficient and log-likelihood ratio, are carefully studied and compared. Experimental results show that the loose phrase extraction is an efficient method for extracting bilingual phrase pairs from n-best alignments, and the translation quality is further improved when introducing the above constraints. Compared with the conventional method, which extracts bilingual phrase pairs from one-best alignment, the qualities of translation results can be significantly improved through the loose phrase extraction and n-best alignments.3. Decoding problem of syntax-based statistical machine translation is studied. After analyzing the shortcomings of reverse decoding method, which fails to make efficient use of the parsing tree to direct the process of translation, the motivation of direct decoding is proposed. Two methods are proposed for direct decoding, the direct decoding algorithm based on beam search and the direct decoding algorithm based on greedy search. Experimental results show that the direct decoding methods outweigh the reverse decoding method, which indicates that the structural information of the parsing tree can be efficiently imposed to direct translation process by direct decoding. By introducing syntactical structure into the phrase-based statistical model, a syntax-based reordering model is also presented, which is helpful to solving the problem of long-distance reordering.4. An IE-based method for brief machine translation is presented to meet the needs of information browsing, under the state-of-the-art of the current machine translation technology. Firstly, the key information of a sentence is extracted and minor parts are dropped by the information extraction; then the skip translation is performed on the extracted parts. Focuses are placed on the hybrid strategy of combining the statistical approach and the rule-based approach. In this strategy, the language model is applied to select proper translations from alternative results that generated by different translation models. Experimental results show that this method is helpful to generating clear translation results and avoiding messy ones, with only little loss of key information.
Keywords/Search Tags:machine translation, statistical approach, bilingual phrase pair, direct decoding algorithm, information extraction
PDF Full Text Request
Related items