Font Size: a A A

Neural Domain Adaptive Chinese Word Segmentation Algorithm

Posted on:2019-02-21Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y BaoFull Text:PDF
GTID:2348330542998690Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Chinese is different from languages like English.Chinese texts are continuously written without word delimiters such as space.Computers need to segment the Chinese texts.Chinese word segmentation is one of the basic tasks of Chinese natural language processing.The performance of Chinese word segmentation system greatly affects the performance of upper-level tasks and plays a very important role in automatic Chinese natural language processing.In the past decades,many large Chinese word segmentation annotation datasets have been established and the Chinese word segmentation algorithm has been improved continuously.From the traditional feature-based segmentation model to the neural network,segmentation systems achieved a high F1 score of over 0.95.However,as the manual annotated data mainly focus on newswires,researchers found that the models trained on these annotated corpora suffer performance degradation in other domains.This problem is well known as domain adaptation.This paper studies neural Chinese word segmentation and its domain adaptation.The main contributions are as follows:(1)For neural Chinese word segmentation model,we propose a combined model of convolutional and recurrent neural network.We introduce the convolutional neural network with multi-convolution kernel to extract the hidden multi-scale features in the sentence.At the same time,we combined the convolutional network and the recurrent network,and the k-max pooling is added to reduce the complexity of the whole model.Experiments on three public datasets show that our combined network achieves a better performance compared with previous work.(2)For the task of semi-supervised domain adaptation of Chinese word segmentation,we explored the differences between the Chinese corpora from different domains and proposed three semi-supervised domain adaptation strategies based on the character language model.Specifically,after counting the uni-gram and bi-gram on Chinese corpora from different domains,we find that the differences between the different Chinese copora are mainly reflected in the combination of characters.Therefore,we propose to use character-level language model to model this relationship,and three specific domain adaptation strategies are proposed.In the experiment,we compared our methods with previous semi-supervised domain adaptation methods on public datasets,and our method achieved a comparable performance with the previous dictionary based method using only unlabeled target domain data.(3)On the fully-supervised domain adaptation of Chinese word segmentation,different from the traditional regularization method,we propose a dynamic regularization strategy based on neural network.Specifically,we use the source domain segmentation model to constrain the training of target domain model.This kind of canonical constraint will control the training of the target domain model according to the probability distribution of different training samples from the source domain model.In the experiments,we achieved better performance on public datasets compared to the previous Chinese word segmentation fully-supervised methods.Our method achieves a similar performance to previous models using less annotated data.
Keywords/Search Tags:Chinese word segmentation, neural network, domain adaptation
PDF Full Text Request
Related items