Font Size: a A A

On Constituent Parsing With Multiple Data Sources

Posted on:2014-05-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:M H ZhuFull Text:PDF
GTID:1318330482954615Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Constituent parsing (also known as phrase-structure parsing) is one of core tasks of natural language processing, which often serves as an important and fundamental component in other tasks, such as machine translation and question answering. Since the release of human-labeled treebanks, data-driven approaches have become the main stream of constituent parsing. Generally, more training data results in parsers with higher parsing accuracies. Our work focuses on exploiting data from multiple sources to enlarge training data of constituent parsers. Here we consider two types of data. One is heterogeneous treebanks, which are treebanks constructed in different organizations and following different annotation standards. Taking into consideration the high cost of annotating parse trees by human, it is practically significant to make full use of treebanks that have already been created. The other type of data that we are interested in is unlabeled data. In contrast to heterogeneous treebanks, unlabeled data is relatively easy to obtain and is generally available in large scale. Our contributions are summarized as follows:We propose to apply an informed decoding approach to treebank conversion. The approach is applied to convert POS tags and syntactic structures, respectively. We fist build a POS tagger (syntactic parser) on the target treebank and then apply the POS tagger (syntactic parser) to process sentences in a source treebank. During the decoding phase of the POS tagger (syntactic parser) on the source treebank, original annotations in the source treebank are used as guiding information of the decoding. The informed-decoding approach reaches a conversion accuracy of 84.2%We propose to apply a feature-based approach to treebank conversion. In contrast to the informed-decoding approach, the feature-based approach encodes annotations in a source treebank as features instead of hard constraints. We first build a POS tagger (syntactic parser) on the source treebank and then apply the POS tagger (syntactic parser) to the sentences of the target treebank. After that, the sentences in the target treebank contain two types of annotations, based on which we build a new POS tagger (syntactic parser) that is used to conduct conversion. The feature-based approach improves the conversion accuracy to 84.8%.We propose to do heterogeneous parsing through collaborative decoding. Compared to treebank conversion, heterogeneous parsing via collaborative decoding is a direct way to use heterogeneous treebanks. The idea of collaborative decoding is to build a parser on individual treebanks, and then apply the resulting parsers to parse testing sentences simultaneously. During the decoding phase, consensus information between the decoders is incorporated to encourage parsers to reach consensus in parsing results. On the two experimental datasets, the co-decoding approach achieves an improvement of 0.5% and 0.7%, respectively.We study semi-supervised shift-reduce constituent parsing. The basic idea is to use an integrated parser to process unlabeled data to obtain a large set of auto-parsed trees. The POS data extracted from the auto-parsed trees is used as additional data to train stand-alone POS taggers, which can provide syntactic parsers with better POS tags. We also extract reliable partial information from auto-parsed trees. Specifically, we use lexical dependency information, based on which we design a set of novel features. Combining the improved stand-alone POS taggers and improved shift-reduce parsers, we finally advance shift-reduce parsing to the state-of-the-art. The resulting parser reaches the accuracy of 90.9% and 82.2% on English and Chinese, respectively.Based on the technologies discussed in the thesis, we developed several state-of-the-art syntactic parsers, which have been deployed successfully in natural language processing systems like machine translation and semantic role labeling.
Keywords/Search Tags:Constituent Parsing, Heterogeneous Treebanks, Treebank Conversion, Collaborative Decoding, Semi-Supervised Shift-Reduce
PDF Full Text Request
Related items