Font Size: a A A

Cross-Domain Andcross-Style Chinesenamed Entity Recognition

Posted on:2016-03-11Degree:MasterType:Thesis
Country:ChinaCandidate:H Y LiFull Text:PDF
GTID:2298330467492894Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Chinese Named Entity Recognition (NER) has been received lots of attentions as a basic and vital task in the field of Chinese Natural Language Process (NLP). One of the most intractable problem of current statistical models is the heterogeneity, namely the statistical discrepancy between training and testing corpus. It causes poor performance of the models.This thesis builds a Chinese Named Entity Recognition System based on Conditional Random Field as a basic model. Two types of unsupervised features, labels trained by Hidden Markov Model (HMM) and word vectors trained by deep neural network, are then joined to retrain the model respectively. The focus of this thesis is to investigate how the retrained model changes its performance when data for unsupervised training and data for testing are drawn from cross-domain or cross-style sources. It is therefore three parts included in the thesis.1. A Chinese NER system based on CRF model is built. The system satisfies application requirements when it is tested on corpus drawn from the same domain with the training data. The model is used as a base for following researches.2. A semi-supervised NER system is built to retrain basic CRF model by joining some HMM features. A large amount of unlabeled data is used to train a HMM model which annotate data with labels as new features. The focus is to investigate the performance of the semi-supervised system when unlabeled data for training HMM and labeled data for training CRF are drawn from different domains or different styles. Experiments show that HMM features promotes the cross-domain adaptability of the system when data comes from same style, but bring no improvement for cross-domain data when data comes from different style. This conclusion enriches current researches of domain-adaptation. 3. Distributed word representations is introduced to the basic CRF model to build a new semi-supervised system that takes advantage of the word representations trained on a large scale of unlabeled data by the deep neural network. As a comparison with HMM featured system above, the new system is also tested on datasets with divergence of domain and style respectively. Experiments show that though the features made by direct concatenation of distributed representation make little contribution to domain or style adaptability of model. The features based on cosine similarity of these representations perform better than HMM features in promoting model’s style-adaptability.
Keywords/Search Tags:named entity recognition, style adaptability, domain adaptability, distributed word representation, conditional random field, hidden markov model
PDF Full Text Request
Related items