Research On Heterogeneous Data Exploitation For Sequence Labeling

Posted on:2018-05-24

Degree:Master

Type:Thesis

Country:China

Candidate:J Y Chao

Full Text:PDF

GTID:2348330542465283

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Most supervised statistical machine learning methods make use of single manually annotated corpus to train the model parameters.However,single manually labeled data usually suffer from limited scale and genre coverage,and it is extremely time and energy-consuming to manually annotate new data.This thesis try to utilize multiple labeled data following different annotation guidelines(i.e.,multiple heterogeneous resources)to improve performance of statistical models,using Chinese POS tagging as our case study.Main research contents of this article are as follows:(1)Conversion of Multiple Resources for POS TaggingWe propose an annotation conversion method using multiple resources for POS tagging,aiming to convert the source-side annotations into target-side and then combine the data to get larger training data.This thesis propses two innovate strategies.The first strategy uses reliability information of guide features.The second strategy uses ambiguous labelings to improve the quality of converted data.Results demonstrate that our first strategy improves the tagging accuracy by a small margin while the second has little impact.(2)Coupled Sequence Labeling on Heterogeneous AnnotationsIn order to effectively utilize multiple datasets with heterogeneous annotations,we propose a coupled sequence labeling model that can directly learn and infer two heterogeneous annotations simultaneously.The basic idea is to bundle two sets of POS tags together(e.g.�[NN,n]�),and form the bundled tag space with a POS tag mapping function.We design and experiment with four different mapping functions,and train the coupled CRF model on two non-overlapping datasets that each has only one-side tags,in the form of ambiguous labelings.Experiments show that our coupled model significantly improve performance of both POS tagging and annotation conversion.(3)Fast Coupled Sequence Labeling on Heterogeneous Annotations via Context-aware PruningOur study shows that the coupled model based on a mapping function can effectively utilize multiple heterogeneous data,but suffer from severe inefficiency due to the large bunlded tag space.We propose a context-aware online pruning strategy that can more accurately capture mapping relationships between annotations based on contextual evidences.Experiments show our approach can effectively solve the inefficiency problem of our coupled model under the complete mapping function,so that the coupled model is comparable to the baseline non-coupled model in the perspective of efficiency,without sacrificing accuracy.In conclusion,this thesis tries to exploit existing resources with different annotation standards to improve the accuracy of Chinese POS tagging.We have accomplished some primitive progress so far,which we hope can further motivate the progress of natural language processing and other high-level applications like machine translation and information retrieval.

Keywords/Search Tags:

Part-of-Speech Tagging, Heterogeneous Data, Annotation Conversion, Coupled Sequence Labeling, Conditional Random Field

PDF Full Text Request

Related items

1	Conversion And Exploitation Of Heterogeneous Annotations Based On Neural Coupled Sequence Labeling
2	Heterogeneous Data In Chinese POS Tagging
3	Research On Parallel Corpora-based Unsupervised Part-of-speech Tagging For Chinese
4	Research On The Learning Of Integrating Chinese Word Segmentation With Part-of-Speech Tagging And Domain Adaption Approach
5	The Research Of Applying Conditional Random Fields To Chinese Word Segmentation And Part-Of-Speech Tagging
6	Research On Data Annotation Of Bank Transaction Short Message Information Based On CRF Model
7	Complextext Sequence Labeling With BILSTM And CRF Algorithm Based On Peephole
8	Research Of Sequence Labeling Technics Based On Graph Models
9	Research On Chinese Lexical Analysis Model Algorithm Based On Deep Learning
10	Research On Object Extraction Of Automobile Product Based On Sequence Labeling