Font Size: a A A

Research On Some Key Technologies Of Tibetan Machine Translation Based On Tree To String

Posted on:2015-12-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q C R HuaFull Text:PDF
GTID:1488304322962729Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The statistical machine translation (SMT) has become most popular in field of machine Translation processing in recent years. Its research has witnessed the development of word-based models, phrase-based models and syntax-based models, and now evolving into models exploiting semantic knowledge. SMT has gotten remarkable gratifying achievements in English and Chinese language. The research for Tibetan syntactic translation model is still in the initial stage. This is because the Tibetan information processing started relatively late, partly because the basic key technologies are not yet fully resolved for Tibetan syntactic t translation.Syntactic translations are syntax tree based model, which represented by syntactic and semantic knowledge contained in the syntax tree. Its prerequisites are relatively mature technologies of lexical analysis, syntax analysis and automatic extraction translation rules on syntactic tree. And language dependency structure holds both syntactic and semantic knowledge, and is viewed as a transition from syntactic representation to semantic representation, to help improve the quality of statistical machine translation. For this reason in thesis we focus on exploiting above key technologies for Tibetan dependency tree to string model SMT, aiming at developing a Tibetan dependency tree as source machine translation system.Specifically, the research contents and results of thesis are summarized to four parts, as follows:1. A Tibetan lexical analysis system that includes word segmentation and POS tagging is implemented. Considering the practicality of Tibetan lexical analysis, this paper put forward first word segmentation after speech tagging strategy. The word segmentation part proposed a perceptron model Tibetan word classification and lattice re-ranking method, and a new rule-based Tibetan syllable segmentation method. We make use of syllable features discriminative model to coarse segment words and generates a words lattice, then calculates shortest path with query dictionary punishment edge weights, finally generates optimal segmentation results. This method holds both words inner local features and non local features between words. POS tagging part proposed a perceptron method discriminative Tibetan speech tagging technology. With Tibetan lexical features, we designed model training feature template, and trained average perceptron weights. Last using beam search decoding algorithm tag POS to word segmented sentence. Experiments show, the Tibetan lexical analysis system has reached the practical level, and it has been applied to Tibetan and Chinese machine translation evaluation in CWMT and syntacitic analysis.2. There is no practical Tibetan dependency parser, dependency syntactic annotation standard and Treebank. This paper first defined36Tibetan dependency annotation classes. Secondly, aimed at the existing problems of building Tibetan dependency Treebank, proposed Tibetan word dependent classification model based semi-automatic Treebank constructing methods, including word-pair dependent classification model and the dependency edge labeling model. We developed semi-automatic syntax tree annotation software with properly designing feature template, and using maximum entropy trained the model. Using this semi-automatic dependency annotation tool, we proofread and constructed a Tibetan dependency TDTreebank1.1contain11thousands sentence. Third, we implement online average perceptron model training algorithm and maximum spanning tree based decoding algorithm. Experiments show that, the Tibetan dependency parser has almost reached level of use.3. The translation rule acquisition algorithm is implemented for Tibetan dependency tree to string model. According to the dependency tree control criterion using head-dependent relation (HDR) fragment to decompose Tibetan dependency tree. Ensure that each HDR fragment containing the overlap node with other HDR fragments, which simply replace as basic operations to generate dependency tree to string translation. The rule extraction algorithm through the tree labeling, acceptable HDR fragment recognition and generation rules three steps. In order to improve the judgment of translation rules, we use open and closed POS of Tibetan word to restrain the rules when it generalization. In the head node rule, we present Tibetan basic numeric translation model. Experiments show that, the POS constraint and basic numeral translation helps to improve the dependency tree to string model performance.4. Tibetan syntactic translation decoding algorithm is implemented, the decoder is based on bottom-up chart parsing algorithm. Since we use sub tree consistency span as constraint for acceptable HDR fragment identification on bilingual corpus, no longer need to reordering model. For lexicalization and various generalization rules, we chose complete matching all of rules scheme, and cube pruning algorithm. On a small scale of Tibetan and Chinese bilingual corpus, experiments show that, Tibetan dependency tree to string model got good performance. This is the first SMT system that has solved the Tibetan syntactic translation model.
Keywords/Search Tags:Tibetan lexical analysis, Tibetan dependency parse, Tibetan Treebank, syntactic translation model, Tibetan dependency tree to string machinetranslation
PDF Full Text Request
Related items