Cross-Domain Andcross-Style Chinesenamed Entity Recognition

Posted on:2016-03-11

Degree:Master

Type:Thesis

Country:China

Candidate:H Y Li

Full Text:PDF

GTID:2298330467492894

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Chinese Named Entity Recognition (NER) has been received lots of attentions as a basic and vital task in the field of Chinese Natural Language Process (NLP). One of the most intractable problem of current statistical models is the heterogeneity, namely the statistical discrepancy between training and testing corpus. It causes poor performance of the models.This thesis builds a Chinese Named Entity Recognition System based on Conditional Random Field as a basic model. Two types of unsupervised features, labels trained by Hidden Markov Model (HMM) and word vectors trained by deep neural network, are then joined to retrain the model respectively. The focus of this thesis is to investigate how the retrained model changes its performance when data for unsupervised training and data for testing are drawn from cross-domain or cross-style sources. It is therefore three parts included in the thesis.1. A Chinese NER system based on CRF model is built. The system satisfies application requirements when it is tested on corpus drawn from the same domain with the training data. The model is used as a base for following researches.2. A semi-supervised NER system is built to retrain basic CRF model by joining some HMM features. A large amount of unlabeled data is used to train a HMM model which annotate data with labels as new features. The focus is to investigate the performance of the semi-supervised system when unlabeled data for training HMM and labeled data for training CRF are drawn from different domains or different styles. Experiments show that HMM features promotes the cross-domain adaptability of the system when data comes from same style, but bring no improvement for cross-domain data when data comes from different style. This conclusion enriches current researches of domain-adaptation. 3. Distributed word representations is introduced to the basic CRF model to build a new semi-supervised system that takes advantage of the word representations trained on a large scale of unlabeled data by the deep neural network. As a comparison with HMM featured system above, the new system is also tested on datasets with divergence of domain and style respectively. Experiments show that though the features made by direct concatenation of distributed representation make little contribution to domain or style adaptability of model. The features based on cosine similarity of these representations perform better than HMM features in promoting model’s style-adaptability.

Keywords/Search Tags:

named entity recognition, style adaptability, domain adaptability, distributed word representation, conditional random field, hidden markov model

PDF Full Text Request

Related items

1	Conditional Random Fields Based English Name Entity Recognition
2	Research On Chinese Named Entity Recognition Technology Based On Neural Networks
3	The Field Of Music, A Combination Of Rules And Statistical Named Entity Recognition
4	Research On The Key Technology Of Named Entity Recognition And Relation Extraction In Military Field
5	Statistical Model Based Chinese Named Entity Recognition Methods And Its Application To Medical Records
6	Research Of Named Entity Recognition Based On Conditional Random Fields
7	Named Entity Recognition Based On Conditional Random Fields Chinese Research
8	Application Research On Chinese Named Entity Recognition Based On Domain Ontology
9	Research On The Named Entity Recognition In The Domain Of Lack Of Annotated Data
10	Research On Chinese Named Entity Recognition And New Word Detection