Font Size: a A A

The Study On Sentence Segmentation For English-chinese Machine Translation

Posted on:2017-03-29Degree:MasterType:Thesis
Country:ChinaCandidate:S L YangFull Text:PDF
GTID:2308330503458938Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Machine translation has always been an important research direction in the field of natural language processing. Machine translation can be traced back to more than a century ago. Although after such a long period of development, the quality of machine translation is still not quite satisfied, especially when comes across the translation of long and complex sentences. Because of the length and complicated structure, the translation results are much worse than it of normal sentences. In the area of machine translation, trying to conduct proper pretreatment on long sentences before translation has become a hot issue in the recent years. Aiming at pretreatment on long sentences, this paper mainly concentrate on splitting long sentences into several short sub-sentences. In this way we try to improve the translation results of long sentences.First of all, this paper conducted very detailed analysis on English long sentences, discussed the characteristics of long sentences, and analyzed the problems that long sentence brings to machine translation systems. Then we quantificationally defined what long sentences is for this paper’s work, and at last we analyzed the different strategy of using long sentences in Chinese and English. On the basis of these work, this paper proposed a method to segment long sentences by adding commas into sentences at proper positions.Then we proposed a hybrid sentence splitting method. The method is consisted of two modules dealing with different problems. One module is a splitting method based on pattern match using dependency structure, and the other is a method based on sequence labeling using conditional random fields. The rule-based method mainly deals with sentence components that are clearer and easier to recognize by rules. We add commas at the boundaries of these components to split long sentences. The module based on CRF tries to transfer the sentence splitting problem into a sequence labeling task by giving a special label to the word before a comma. We try to establish a probability model for comma positions and calculate the probability of a position where a comma can be added. This method mainly handles sentences components which are difficult to recognize by rules.The modules described above are integrated by a parallel strategy, which means that the long sentences are processed using both of the modules. The two parts of results are merged together finally, and the sentence is split at these positions which are found by the two modules. The two modules cooperate and compensate each other’s disadvantages. At the last part of this paper, we designed two groups of experiments to test our method. One group of experiments is designed to verify the effectiveness and rationality of our method, and the other group is designed to test the influence of our method on machine translation results.
Keywords/Search Tags:Machine Translation, Long sentence segmentation, Preprocessing, Condition Random Fields
PDF Full Text Request
Related items