Font Size: a A A

Research On Chinese Word Segmentation Strategies For Statistical Machine Translation

Posted on:2014-08-21Degree:DoctorType:Dissertation
Country:ChinaCandidate:N XiFull Text:PDF
GTID:1108330482951897Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As the society marching into the information era, the demand for translation between languages has been increasing day by day. However, conventional human-based translation has become further from enough to meet this demand. In such a context, machine translation, especially statistical machine translation (SMT), has gained more and more popularity due to its capability of automatic learning and generating acceptable translation outputs.Linguistically, "word" is the minimum unit which can be used independently. Different from many western languages such as English, in the writing system of Chinese, there are no delimiters between characters to indicate clear word boundaries. Splitting a Chinese sentence into a sequence of words, a.k.a. word segmentation, has become an important preprocessing step in many Chinese-related natural language processing (NLP) application such as Chinese-related machine translation. The theoretical linguists have made some efforts in defining Chinese words, however, they have not come to a full agreement. It has been accepted that different NLP applications would benefit from different segmentation granularity, and it has also been accepted that better word segmentation performance from monolingual point of view will not necessarily yield better translation performance from bilingual point of view. Therefore, it will be necessary to reconsider the strategies of word segmentation for machine translation.The impact of word segmentation on machine translation is complex, which is embodied in two aspects:1. From an overall perspective, the properties of the steps in the SMT pipeline differ from one another, indicating that different step may need different word segmentation. However, when optimizing word segmentation for SMT, previous works have neglected this overall impact by assuming that all steps in the SMT pipeline should use the same word segmentation. Such practice may lead to sub-optimal problem in optimizing word segmentation for SMT thus hurt the translation performance.2. From a local perspective, each step of SMT is faced with a problem on how to choose the word segmentation granularity:The coarser the granularity is, the more context could be captured; however, the coarser the granularity the more likely it is to cause data sparseness. When optimizing word segmentation in each step, many previous works used only single word segmentation, which may run the risks of lossing segmentation granularity beneficial for this step.In view of the complex impact of word segmentation on SMT and the shortcomings of the previous works, this thesis proposes a framework of combining word segmentations in Chinese-English (English-Chinese) SMT, in order to make full use of the diverse and complementary knowledge implied in multiple word segmentations.1. In view of the overall impact of word segmentation on SMT, the thesis proposes a serial combination strategy:use different word segmentations in different steps. The serial combination strategy alleviates the sub-optimization problem by searching better word segmentation combination for SMT.2. In view of the local impact of word segmentation on SMT, the thesis proposes a parallel combination strategy:integrate multiple word segmentations in each step. Diverse and complementary knowledge implied in multiple word segmentations are employed to improve the performance of this step. In word alignment step, a heuristic approach is first proposed to combine word alignments based on multiple word segmentations; To tackle the limitation of the heuristic approach in modeling, searching, and training, a discriminative word alignment based on multiple word segmentations is then proposed, issues such as modeling, searching, training is formally defined in the context of multiple word segmentations; To improve the reordering capability of language model, language models based on multiple word segmentations are integrate in the decoding step in English-Chinese SMT. Overall, the parallel combination strategy lower the risks of lossing segmentation granularity beneficial for this step.3. Previous works used only bilingual word alignment to learn SMT-motivated word segmentation. To tackle its limitation, the thesis proposes a word segmenter which integrates monolingual and bilingual knowledge. Based on sequence labeling models, this approach uses not only bilingual word alignment but also the segmentation results of the monolingual word segmenter to learn an independent and SMT motivated word segmenter. This approach alleviated the limitation of previous works on learning SMT-motivated word segmentation.
Keywords/Search Tags:machine translation, Chinese word segmentation, combination, word alignment, language model
PDF Full Text Request
Related items