Font Size: a A A

Conversion And Exploitation Of Heterogeneous Annotations Based On Neural Coupled Sequence Labeling

Posted on:2021-04-08Degree:MasterType:Thesis
Country:ChinaCandidate:D P HuangFull Text:PDF
GTID:2428330605974880Subject:Software engineering major
Abstract/Summary:PDF Full Text Request
Supervised statistical machine learning methods rely on high-quality manual annotations to learn model parameters.However,manual annotations cost a lot of human effort and time.In the field of natural language processing(NLP),there often exist multiple labeled corpora for the same task following different annotation guidelines,which is called multi-source heterogeneous data.Compared with single manual annotations,multisource heterogeneous annotations are obviously superior in terms of both scale and genre coverage,and thus can alleviate the problem of data sparseness during model training.In this paper,we take two lexical analysis tasks as our case study,namely word segmentation(WS)and part-of-speech(POS)tagging.This paper proposes a neural coupled sequence labeling method for direct use of multisource heterogeneous annotations,which on the one hand is able to perform annotation conversion between heterogeneous tag sets effectively(i.e.,homogenization),and on the other hand improves the segmentation and tagging accuracy.Specifically,the main research contents of this paper are as follows.(1)Construction of crowdsourcing data annotation system and annotation of lexical dataCurrently,almost all of the lexical annotations follow a single annotation guideline.It is necessary to manually annotate a certain scale of same-source data with heterogeneous annotations(i.e.,each sentence having manual annotations following different annotation guidelines simultaneously),in order to evaluate annotation conversion accuracy between heterogeneous data.For this purpose,we randomly select 1,000 sentences from the People's Daily Corpus(PD),and manually annotate POS tags following Chinese Penn Treebank(CTB)guideline.In order to guarantee the quality of manual annotations,we design an effective annotation workflow and develop a crowdsourcing annotation system based on browser.In addition to POS annotation,our annotation system also supports many other NLP tasks,such as multi-class classification,hierarchical classification,WS,named entity recognition(NER),and dependency parsing.(2)Neural coupled sequence labeling model for heterogeneous lexical dataIn order to directly use multi-source heterogeneous data for training model,Li et al.(2015)proposed a coupled sequence labeling model based on traditional discrete features to directly learn and infer two heterogeneous tags.The basic idea is to bundle two POS tags together(e.g.,"NN@n")to form a coupled POS tag,and train the model in the coupled POS tag space based on the idea of ambiguous labeling.This paper extends the coupled sequence labeling model based on discrete features to a neural network framework.We employ a multilayer BiLSTM as encoder.When predicting scores,we employ three MLPs to predict the scores of two sets of separate tags and the scores of a set of joint tags,and add the three scores as the final score according to the mapping relationship.Experiments show that compared with the benchmark model using a single training data,the neural coupled sequence labeling model achieves significant accuracy improvements on both the POS tagging and WSPOS tagging tasks.Compared with the multi-task learning model,the neural coupled sequence labeling model is also superior in annotation conversion of heterogeneous data.(3)Fast neural coupled sequence labeling model via label pruningThe coupled sequence labeling model directly performs the Cartesian product of the tag set of two datasets,resulting in a huge size of bundled tag space.For example,for WSPOS tagging task,the size of bundled tags exceeds 10,000.This will cause inefficiency problem and large memory usage of the model.Li et al.(2016)proposed a context-aware local pruning strategy for the coupled sequence labeling model based on traditional discrete features,which is very useful for model efficiency.Compared with the models based on traditional discrete features,models based on neural network rely on large-matrix parallel operations to improve model efficiency,and context-aware local pruning strategy will produce different candidate answer sets at different locations,so large-matrix operations cannot be used.In view of this,we propose a strategy to directly prune on the bundled tag set for the neural coupled sequence labeling model.Firstly,the trained model is used to predict the coupled tags with noise on multiple training sets,then the low-frequency coupled tags are pruned according to the frequency,and finally a fast neural coupled sequence labeling model is built in the pruned bundled tag space.Experimental results show that this method can obviously improve the efficiency of the model without affecting the accuracy of analysis and conversion.In summary,this paper proposes a neural coupled sequence labeling model that can ef-fectively utilize multi-source heterogeneous data and improve the performance of Chinese lexical analysis.In this research,we achieve some preliminary results on the POS tagging task,the joint WSPOS tagging task,and the POS conversion task between heterogeneous annotations.We hope that these research results can further promote the research and development of upper-level tasks in the field of NLP.
Keywords/Search Tags:POS tagging, WS, Deep Learning, Heterogeneous data, Annotation Conversion, Coupled Sequence Labeling, BiLSTM, Annotation System, Tag Pruning
PDF Full Text Request
Related items