Font Size: a A A

Neural Networks Incorporating Multiple Target Domain Information For Cross-domain Chinese Word Segmentation

Posted on:2021-03-29Degree:MasterType:Thesis
Country:ChinaCandidate:T HuangFull Text:PDF
GTID:2428330605482461Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Chinese word segmentation(CWS)is the fundamental step for many Chinese natural language processing(NLP)tasks,revealing the pivotal position it owns in Chinese NLP.Research on CWS algorithms has been going on for decades,and it shows that the CWS models based on character tagging are superior to the ones based on traditional string matching and statistical methods in terms of accuracy and recall.In recent years,deep neural networks have been payed lots of attention in various fields of NLP,especially CWS.Compared with the character tagging-based machine learning CWS model,the deep neural-based CWS model,which also based on character tagging,does not require complicated feature engineering,and it even has higher segmentation accuracy.However,the deep neural-based CWS model still exists the cross-domain problem,which means the segmentation accuracy will drop dramatically when the training set and the test set belong to different domains.The domains of the training and test sets are referred to as the source and target domains respectively in the following.At present,many methods,incorporating additional target domain information into the neuralbased CWS models,have been proposed to solve the cross-domain problem.However,most of these methods simply focus on improving the segmentation accuracy of their models in the target domain,without considering the universality.In addition,partiallylabeled data is always used in a single way which changes the loss function of the supervised CWS model.Other methods of using partially-labeled data in CWS are remaining to be studied.The main research of this paper is as follows.(1)A novel neural-based CWS model,which incorporates both common lexicons and unlabeled data derived from the target domain into BERT,is proposed to solve the cross-domain problem from the perspective of segmentation accuracy and universality.From the perspective of universality,BERT is used as the benchmark CWS model.What's more,we design a lexicon-based feature vector that can well reflect the position information of a single character in the related word.From the perspective of segmentation accuracy,a language model is adopted to learn the target domain information from unlabeled data.Finally,a gate structure is used to integrate the lexicon-based feature vector and the output vector of the language model into BERT.Experiments show that this model can achieve high F1 values on Zhu Xian,self-made datasets and SIGHAN2010,which proves that the model has very strong domain adaptability and can well solve the cross-domain problem.(2)A self-training neural-based CWS model using partially-labeled data is proposed.Firstly,a new scheme is designed to obtain artificial partially-labeled data by using lexicons and unlabeled data.Then,the partially-labeled data is used to train the Bi LSTM CWS model by modifying the loss function.Finally,the self-training method is used to iteratively add the partially-labeled data,which satisfies the segmentation accuracy confidence and difference confidence,to the labeled data,and the CWS model is continuously optimized next,so that the final model can achieve excellent segmentation performance in target domain.Experiments on SIGHAN2005 and SIGHAN2010 prove that the proposed model can effectively improve the segmentation accuracy of the model in the target domain.
Keywords/Search Tags:Chinese word segmentation, cross-domain, neural network, lexicon, unlabeled data, partially-labeled data
PDF Full Text Request
Related items