Font Size: a A A

Chinese Syntax Parsing And Its Application To Chinese-English Statistical Machine Translation

Posted on:2008-03-09Degree:MasterType:Thesis
Country:ChinaCandidate:H ZhangFull Text:PDF
GTID:2178360242978545Subject:Computer applications
Abstract/Summary:PDF Full Text Request
In this paper, we made a survey on two of the most important research areas of natural language processing: PCFG parsing and machine translation. Based on the analyses of milestone technologies, we design and implement a Chinese PCFG Parser and a primary tree-to-tree based statistical machine translation system.For PCFG Parsing, we integrate the pos tagging and parsing process to combat the error-snowball effect. We separate the Parsing into two processes, the initialization process and the extension process. In the initialization process, we let each possible POS of each word to be a possible edge, and assign corresponding priority and probability to these edges based on the HMM hypothesis. And for the extension process, we adopt the Edge-based best-first parsing algorithm which is proposed by Charniak to select the next edge by highest priority and Collins' Head-driven markovization hypothesis to generate the new edge based on the selected edge. To the edges other than POS edges, the priority was calculated based on guessing the next word; And for the calculation of the probability of the new generated edges, we adopt Charniak's feature function and MaxEnt-Inspired method, except that we use EM to calculate the lambdas for each sub function. Our parser got 80.36% F-Measure on Penn Chinese Treebank 1.0, which is comparable to the state-of-the-art Chinese PCFG Parsers.For machine translation, we propose a variation of the tree-to-tree model. We extract the rules from the parallel parsed treebanks. The rule generation process is bottom up; we define the aligned subtree pairs based on word alignment. Once we find an aligned subtree pair, we first extract rule and then condense the aligned subtree to its root node, so that this subtree became a new leaf to latter process, we call this new kind of node "generalized leaf. For the rule generation, we simplify the rule representation from the whole tree to its root node and generalized leaves. We implemented a primary system based on this model, the results show that this model is much more compact than the other state-of-the-art SMT systems and is with greatly promising potential.
Keywords/Search Tags:NLP, PCFG Pasring, SMT
PDF Full Text Request
Related items