Research On Chinese Word Segmentation Strategies For Statistical Machine Translation

Posted on:2014-08-21

Degree:Doctor

Type:Dissertation

Country:China

Candidate:N Xi

Full Text:PDF

GTID:1108330482951897

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

As the society marching into the information era, the demand for translation between languages has been increasing day by day. However, conventional human-based translation has become further from enough to meet this demand. In such a context, machine translation, especially statistical machine translation (SMT), has gained more and more popularity due to its capability of automatic learning and generating acceptable translation outputs.Linguistically, "word" is the minimum unit which can be used independently. Different from many western languages such as English, in the writing system of Chinese, there are no delimiters between characters to indicate clear word boundaries. Splitting a Chinese sentence into a sequence of words, a.k.a. word segmentation, has become an important preprocessing step in many Chinese-related natural language processing (NLP) application such as Chinese-related machine translation. The theoretical linguists have made some efforts in defining Chinese words, however, they have not come to a full agreement. It has been accepted that different NLP applications would benefit from different segmentation granularity, and it has also been accepted that better word segmentation performance from monolingual point of view will not necessarily yield better translation performance from bilingual point of view. Therefore, it will be necessary to reconsider the strategies of word segmentation for machine translation.The impact of word segmentation on machine translation is complex, which is embodied in two aspects:1. From an overall perspective, the properties of the steps in the SMT pipeline differ from one another, indicating that different step may need different word segmentation. However, when optimizing word segmentation for SMT, previous works have neglected this overall impact by assuming that all steps in the SMT pipeline should use the same word segmentation. Such practice may lead to sub-optimal problem in optimizing word segmentation for SMT thus hurt the translation performance.2. From a local perspective, each step of SMT is faced with a problem on how to choose the word segmentation granularity:The coarser the granularity is, the more context could be captured; however, the coarser the granularity the more likely it is to cause data sparseness. When optimizing word segmentation in each step, many previous works used only single word segmentation, which may run the risks of lossing segmentation granularity beneficial for this step.In view of the complex impact of word segmentation on SMT and the shortcomings of the previous works, this thesis proposes a framework of combining word segmentations in Chinese-English (English-Chinese) SMT, in order to make full use of the diverse and complementary knowledge implied in multiple word segmentations.1. In view of the overall impact of word segmentation on SMT, the thesis proposes a serial combination strategy:use different word segmentations in different steps. The serial combination strategy alleviates the sub-optimization problem by searching better word segmentation combination for SMT.2. In view of the local impact of word segmentation on SMT, the thesis proposes a parallel combination strategy:integrate multiple word segmentations in each step. Diverse and complementary knowledge implied in multiple word segmentations are employed to improve the performance of this step. In word alignment step, a heuristic approach is first proposed to combine word alignments based on multiple word segmentations; To tackle the limitation of the heuristic approach in modeling, searching, and training, a discriminative word alignment based on multiple word segmentations is then proposed, issues such as modeling, searching, training is formally defined in the context of multiple word segmentations; To improve the reordering capability of language model, language models based on multiple word segmentations are integrate in the decoding step in English-Chinese SMT. Overall, the parallel combination strategy lower the risks of lossing segmentation granularity beneficial for this step.3. Previous works used only bilingual word alignment to learn SMT-motivated word segmentation. To tackle its limitation, the thesis proposes a word segmenter which integrates monolingual and bilingual knowledge. Based on sequence labeling models, this approach uses not only bilingual word alignment but also the segmentation results of the monolingual word segmenter to learn an independent and SMT motivated word segmenter. This approach alleviated the limitation of previous works on learning SMT-motivated word segmentation.

Keywords/Search Tags:

machine translation, Chinese word segmentation, combination, word alignment, language model

PDF Full Text Request

Related items

1	Research Of Some Key Issues In Highly Adaptive Example-Based Machine Translation
2	Research And Implementation Of Chinese Word Segmentation Algorithm
3	Research On Improving The Performance Of Chinese-Uyghur Word Alignment For Statistical Machine Translation
4	Research On Chinese Word Segmentation Based On Machine Translation Technology
5	Low-Resource Machine Translation Techniques For Distant Language Pair
6	Research On Chinese-uyghur Word-alignment For Statistical Machine Translation
7	Research And Application Of Key Technologies Of Chinese-english Parallel Corpus
8	The Application Research Of Word Sense Disambiguation In The Statistical Machine Translation
9	Study On Word Alignment For Re-ordering Of Web-mined OOV Translation Candidates
10	Research On English And Chinese Interactive Machine Translation Technique Base On Word Prediction