Font Size: a A A

Research On Heterogeneous Data Exploitation For Sequence Labeling

Posted on:2018-05-24Degree:MasterType:Thesis
Country:ChinaCandidate:J Y ChaoFull Text:PDF
GTID:2348330542465283Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Most supervised statistical machine learning methods make use of single manually annotated corpus to train the model parameters.However,single manually labeled data usually suffer from limited scale and genre coverage,and it is extremely time and energy-consuming to manually annotate new data.This thesis try to utilize multiple labeled data following different annotation guidelines(i.e.,multiple heterogeneous resources)to improve performance of statistical models,using Chinese POS tagging as our case study.Main research contents of this article are as follows:(1)Conversion of Multiple Resources for POS TaggingWe propose an annotation conversion method using multiple resources for POS tagging,aiming to convert the source-side annotations into target-side and then combine the data to get larger training data.This thesis propses two innovate strategies.The first strategy uses reliability information of guide features.The second strategy uses ambiguous labelings to improve the quality of converted data.Results demonstrate that our first strategy improves the tagging accuracy by a small margin while the second has little impact.(2)Coupled Sequence Labeling on Heterogeneous AnnotationsIn order to effectively utilize multiple datasets with heterogeneous annotations,we propose a coupled sequence labeling model that can directly learn and infer two heterogeneous annotations simultaneously.The basic idea is to bundle two sets of POS tags together(e.g.“[NN,n]”),and form the bundled tag space with a POS tag mapping function.We design and experiment with four different mapping functions,and train the coupled CRF model on two non-overlapping datasets that each has only one-side tags,in the form of ambiguous labelings.Experiments show that our coupled model significantly improve performance of both POS tagging and annotation conversion.(3)Fast Coupled Sequence Labeling on Heterogeneous Annotations via Context-aware PruningOur study shows that the coupled model based on a mapping function can effectively utilize multiple heterogeneous data,but suffer from severe inefficiency due to the large bunlded tag space.We propose a context-aware online pruning strategy that can more accurately capture mapping relationships between annotations based on contextual evidences.Experiments show our approach can effectively solve the inefficiency problem of our coupled model under the complete mapping function,so that the coupled model is comparable to the baseline non-coupled model in the perspective of efficiency,without sacrificing accuracy.In conclusion,this thesis tries to exploit existing resources with different annotation standards to improve the accuracy of Chinese POS tagging.We have accomplished some primitive progress so far,which we hope can further motivate the progress of natural language processing and other high-level applications like machine translation and information retrieval.
Keywords/Search Tags:Part-of-Speech Tagging, Heterogeneous Data, Annotation Conversion, Coupled Sequence Labeling, Conditional Random Field
PDF Full Text Request
Related items