Font Size: a A A

Research On The Building Parallel Corpora For Machine Translation Based On Non-Parallel Data

Posted on:2016-10-11Degree:MasterType:Thesis
Country:ChinaCandidate:M P DongFull Text:PDF
GTID:2308330503456370Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As international communication increases, language gap problem limits people’s communication more frequently. The purpose of machine translation is to build communication bridges between different languages. Parallel corpora play a critical role in statistical machine translation. But parallel corpora obtained from parallel sites are small in scale, low in domain coverage and update slowly. Compared with parallel corpora,the bilingual non-parallel data come in large scale, cover wide domain and update quickly. As a result, learning translation models from bilingual non-parallel data has attracted intensive attention from the community.Bilingual non-parallel data can be divided into comparable corpora and non-parallel corpora. For comparable corpora, there are still a lot of parallel sentence pairs, so we can extract parallel sentences from them and add these parallel sentences into the original parallel corpora, in order to expand their scale and coverage. But there are few parallel sentences in non-parallel corpora because of data sparsity. In this paper, we have done some research on the machine translation model based on bilingual non-parallel data. The main contributions are as follows:We propose to use query lattice in translation retrieval to extract parallel sentences from comparable corpora. In this framework, exponentially many queries query are represented as a query lattice. Compared with prior work our approach runs much faster(from 0.75 second per sentence to 0.13), and retrieves more accurately(from 83.76% to93.16%). We also use this method in extracting parallel sentences from comparable corpora to train machine translation. Experiment shows that our method gains 2.6 BLEU improvement over previous approach.We propose an iterative approach to learning bilingual lexicons and phrases jointly from non-parallel corpora. Given two sets of monolingual data that might contain parallel phrases, we develop a generative model based on IBM model 1, which treats the mapping between phrase pairs as a latent variable. The model is trained with Viterbi EM algorithm.Experiment shows that our method gains 2.1 BLEU improvement over previous approach.
Keywords/Search Tags:non-parallel, translation retrieval, lattice, EM
PDF Full Text Request
Related items