The Study On Sentence Segmentation For English-chinese Machine Translation

Posted on:2017-03-29

Degree:Master

Type:Thesis

Country:China

Candidate:S L Yang

Full Text:PDF

GTID:2308330503458938

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Machine translation has always been an important research direction in the field of natural language processing. Machine translation can be traced back to more than a century ago. Although after such a long period of development, the quality of machine translation is still not quite satisfied, especially when comes across the translation of long and complex sentences. Because of the length and complicated structure, the translation results are much worse than it of normal sentences. In the area of machine translation, trying to conduct proper pretreatment on long sentences before translation has become a hot issue in the recent years. Aiming at pretreatment on long sentences, this paper mainly concentrate on splitting long sentences into several short sub-sentences. In this way we try to improve the translation results of long sentences.First of all, this paper conducted very detailed analysis on English long sentences, discussed the characteristics of long sentences, and analyzed the problems that long sentence brings to machine translation systems. Then we quantificationally defined what long sentences is for this paper’s work, and at last we analyzed the different strategy of using long sentences in Chinese and English. On the basis of these work, this paper proposed a method to segment long sentences by adding commas into sentences at proper positions.Then we proposed a hybrid sentence splitting method. The method is consisted of two modules dealing with different problems. One module is a splitting method based on pattern match using dependency structure, and the other is a method based on sequence labeling using conditional random fields. The rule-based method mainly deals with sentence components that are clearer and easier to recognize by rules. We add commas at the boundaries of these components to split long sentences. The module based on CRF tries to transfer the sentence splitting problem into a sequence labeling task by giving a special label to the word before a comma. We try to establish a probability model for comma positions and calculate the probability of a position where a comma can be added. This method mainly handles sentences components which are difficult to recognize by rules.The modules described above are integrated by a parallel strategy, which means that the long sentences are processed using both of the modules. The two parts of results are merged together finally, and the sentence is split at these positions which are found by the two modules. The two modules cooperate and compensate each other’s disadvantages. At the last part of this paper, we designed two groups of experiments to test our method. One group of experiments is designed to verify the effectiveness and rationality of our method, and the other group is designed to test the influence of our method on machine translation results.

Keywords/Search Tags:

Machine Translation, Long sentence segmentation, Preprocessing, Condition Random Fields

PDF Full Text Request

Related items

1	Research On Deep Learning Based Bilingual Long Sentence Segmentation Method
2	The Research On Machine Translation From English To Chinese Of Long English Sentence
3	Research On Key Technologies Of Data Processing For Machine Translation
4	The Research Of Applying Conditional Random Fields To Chinese Word Segmentation And Part-Of-Speech Tagging
5	Design And Implementation Of Heuristic Analogy Translation Mechanism In IHSMTS
6	Study And Implementation On Key Techniques For Example Based Machine Translation
7	Chinese And Mongolian Lexical Analysis Research And Its Application In Statistical Machine Translation
8	The Application And Research Of Condition Random Fields And Maximum Entropy In Tag Mining
9	Research Of Chinese Phrase Identification Based On Conditional Random Fields
10	Research On The Key Problems In Tourism Text Oriented Chinese-English SMT