Font Size: a A A

Chinese-English Parallel Phrase Dependency Treebank:Construction And Application

Posted on:2014-01-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:J X CaoFull Text:PDF
GTID:1268330425477355Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Parallel corpora are valuable resources for Natural Language Processing (NLP) and Machine Translation (MT). Most parallel corpora now are aligned at the sentence level, and only a few at word or phrase level. With the help of web data mining technology, the size of parallel corpora is unprecedentedly increased and statistical machine translation has seen significant progress. But the increasing scale of the parallel texts cannot solve all the NLP and MT difficulties. Many complex linguistic phenomena and translation problems still need the support of corpora with rich linguistic annotations to improve the precision of analysis and translation. Parallel Aligned Treebanks (PATs) are a fashion in this line.This paper proposes a hybrid Phrase Dependency Grammar (PDG), which adopts verb-headed case grammar of DG instead of the binary branching of PSG for the clause, keeps the constituency nodes of PSG and abandons the technical mono-headed binary branching of DP. The annotation scheme for a PDG-based Treebank (PDT) is designed and a Chinese-English Parallel PDT (DUT-CEPDT) is constructed. The DUT-CEPDT is a node-aligned parallel PDT, in which only the semantic Head is recognized while every node is annotated with its constituency type (CT) and its dependency types (DTs). The DTs are assigned separately with the Syntactic-Function types once for all with a closed tagset and the semantic roles dynamically with an open tagset. The bracketing and annotation of the CTs and the DTs are carried out with the LingTreeConstructor, and the node alignment is conducted by connecting the translation node IDs with our specially designed editor. The results of the inter-annotator agreement experiments show that the scheme is workable.DUT-CEPDT has processed two Chinese government work reports and100United Nations Resolutions of169,360Chinese characters and128,283English words. It is designed primarily for its application in NLP and MT, particularly parser training, extraction of translation lexicon and transfer rules, but the extended applications include its role in the development of grammar teaching aids and various linguistic research projects.
Keywords/Search Tags:Phrase Dependency Grammar, Chinese-English Parallel Treebank, Node Alignment, Natural Language Processing, Machine Translation
PDF Full Text Request
Related items