Font Size: a A A

Research On Semi-supervised Domain Adaptation For Chinese Dependency Parsing

Posted on:2021-03-27Degree:MasterType:Thesis
Country:ChinaCandidate:X PengFull Text:PDF
GTID:2428330605474883Subject:Computer technology major
Abstract/Summary:PDF Full Text Request
In recent years,with the development of deep learning,dependency parsing has achieved great progress.However,the parsing accuracy degrades dramatically when the evaluation data greatly differs from the training data in styles,types,and genres,known as the domain adaptation problem.In particular,with the surge of user generated content on the Internet,which is usually different from canonical newswire texts,cross-domain parsing has become the major challenge for applying dependency parsing outputs to realistic NLP systems.Due to the lack of sufficient multi-domain labeled data,existing cross-domain dependency pars-ing works mostly focus on unsupervised methods,that is,there is no target-domain(the domain of the test data)labeled data in the training set.However,unsupervised domain adaptation is very difficult and has made little progress.This paper uses both source-and target-domain labeled data for training,and thus focuses on semi-supervised cross-domain dependency parsing.Firstly,we construct a large-scale multi-domain Chinese Open Depen-dency Treebank(CODT).Then,based on the CODT,we propose several semi-supervised domain adaptation methods and conducts extensive experiments for in-depth comparison and analysis.Finally,this paper discusses how to use large-scale unlabeled data to improve the performance of cross-domain dependency parsing.(1)Propose a new Chinese dependency parsing annotation guideline and construct a Chinese dependency treebankThe construction of dependency treebank needs a dependency parsing annotation guide-line as the theoretical basis.Existing dependency parsing annotation guidelines mostly con-sider the annotation of canonical texts,and may ignore some syntax phenomena appearing in non-canonical texts.Therefore,this paper first proposes a Chinese dependency parsing an-notation guideline that covers different linguistic phenomena.In order to control the quality of data annotation,this paper proposes a strict two-person annotation process.At the same time,we analyze the accuracy and consistency of labeled data to improve the deficiencies in the labeling work in time.In addition,this paper adopts the partial annotation method to save the cost of the data annotation,which means that we only select the most difficult words in a sentence for manual annotation.The new treebank is named as Chinese Open Dependency Treebank(CODT),and contains about 130,000 sentences from 11 domains.(2)Cross-domain dependency parsing based on the domain embedding methodBased on CODT,this paper focuses on the semi-supervised cross-domain dependency parsing scenario.The key of semi-supervised methods is how to fully extract features from source-and target-domain training data.This paper proposes a novel domain embedding method.The idea is to directly add an extra domain embedding to each input word so that the model is able to learn both general information and the differences between two domains.Then,This paper applies the proposed domain embedding method to the multi-source cross-domain dependency parsing task,where other target-domain training datasets are also used as extra training data for improving the parsing performance on one target domain.The experimental results show that 1)the domain embedding method is more effective than other baseline methods;2)if the differences between one source domain and the target domain are small,the performance can be further improved by using the source domain as extra training data.Otherwise,using the source domain would introduce noise to the model.In addition,this paper uses corpus weighting strategy since there is great gap between sizes of the source-and target-domain training datasets.In each iteration,the ratio of the source domain to the target domain training data is controlled to prevent the target domain from being overwhelmed by the source domain.Experiments show that choosing different ratio has great impact on the performance.(3)Cross-domain dependency parsing based on language model fine-tuningIn the previous chapter,only labeled data is used to study semi-supervised domain adap-tation methods.However,due to the extreme complexity and heavy cost for data annotation,how to use large-scale unlabeled data also one of the important research direction in do-main adaptation.In recent years,context-aware language models have achieved rapid de-velopment and helped many data-driven natural language processing tasks.In this paper,we extract features from large-scale unlabeled data by pre-training and fine-tuning the context-sensitive language models(ELMo and BERT).The experimental results show that 1)using general Elmo and BERT model can help improve the performance of cross-domain depen-dency parsing;2)compared with the traditional self-training method using unlabeled data,using Elmo and Bert to extract features from large-scale unlabeled data is more effective.Detailed analysis of the experimental results shows the size of target-domain training data has great influence on the dependency parsing performance.Concretely,we discuss how much target domain training data is most suitable in the domain adaptation work,which may provide guidance for future data annotation and domain adaptation work.In addition,we have organized the cross-domain dependency parsing shared task to provide the CODT for more researchers.This paper reports the submitted results and summarizes the experimental methods that are used by the participants.In summary,this paper first constructs a high-quality Chinese dependency treebank.Then,based on the treebank,this paper study semi-supervised methods of cross-domain dependency parsing.We hope that our preliminary results will contribute to the development of cross-domain dependency parsing.
Keywords/Search Tags:dependency parsing, Chinese dependency treebank, semi-supervised do-main adaptation, domain embedding, language model, fine tuning
PDF Full Text Request
Related items