Neural Networks Incorporating Multiple Target Domain Information For Cross-domain Chinese Word Segmentation

Posted on:2021-03-29

Degree:Master

Type:Thesis

Country:China

Candidate:T Huang

Full Text:PDF

GTID:2428330605482461

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Chinese word segmentation(CWS)is the fundamental step for many Chinese natural language processing(NLP)tasks,revealing the pivotal position it owns in Chinese NLP.Research on CWS algorithms has been going on for decades,and it shows that the CWS models based on character tagging are superior to the ones based on traditional string matching and statistical methods in terms of accuracy and recall.In recent years,deep neural networks have been payed lots of attention in various fields of NLP,especially CWS.Compared with the character tagging-based machine learning CWS model,the deep neural-based CWS model,which also based on character tagging,does not require complicated feature engineering,and it even has higher segmentation accuracy.However,the deep neural-based CWS model still exists the cross-domain problem,which means the segmentation accuracy will drop dramatically when the training set and the test set belong to different domains.The domains of the training and test sets are referred to as the source and target domains respectively in the following.At present,many methods,incorporating additional target domain information into the neuralbased CWS models,have been proposed to solve the cross-domain problem.However,most of these methods simply focus on improving the segmentation accuracy of their models in the target domain,without considering the universality.In addition,partiallylabeled data is always used in a single way which changes the loss function of the supervised CWS model.Other methods of using partially-labeled data in CWS are remaining to be studied.The main research of this paper is as follows.(1)A novel neural-based CWS model,which incorporates both common lexicons and unlabeled data derived from the target domain into BERT,is proposed to solve the cross-domain problem from the perspective of segmentation accuracy and universality.From the perspective of universality,BERT is used as the benchmark CWS model.What's more,we design a lexicon-based feature vector that can well reflect the position information of a single character in the related word.From the perspective of segmentation accuracy,a language model is adopted to learn the target domain information from unlabeled data.Finally,a gate structure is used to integrate the lexicon-based feature vector and the output vector of the language model into BERT.Experiments show that this model can achieve high F1 values on Zhu Xian,self-made datasets and SIGHAN2010,which proves that the model has very strong domain adaptability and can well solve the cross-domain problem.(2)A self-training neural-based CWS model using partially-labeled data is proposed.Firstly,a new scheme is designed to obtain artificial partially-labeled data by using lexicons and unlabeled data.Then,the partially-labeled data is used to train the Bi LSTM CWS model by modifying the loss function.Finally,the self-training method is used to iteratively add the partially-labeled data,which satisfies the segmentation accuracy confidence and difference confidence,to the labeled data,and the CWS model is continuously optimized next,so that the final model can achieve excellent segmentation performance in target domain.Experiments on SIGHAN2005 and SIGHAN2010 prove that the proposed model can effectively improve the segmentation accuracy of the model in the target domain.

Keywords/Search Tags:

Chinese word segmentation, cross-domain, neural network, lexicon, unlabeled data, partially-labeled data

PDF Full Text Request

Related items

1	Research On Domain Adaptation Of Chinese Word Segmentation With Multi-source Features And Data
2	Research On Boosting Chinese Word Segmentation Accuracy With Partially Annotated Data
3	Research And Implementation For Chinese Lexicon Analysis System Based On Neural Network
4	Research On Cross-domain Chinese Word Segmentation Method Based On New Word Discovery
5	Neural Domain Adaptive Chinese Word Segmentation Algorithm
6	Research On Chinese Word Segmentation For Domain Literature
7	Research On Neural Network Based Methods Of Chinese Word Segmentation For Domain Adaptation
8	Research On Word Segmentation Based On Probabilistic Model Of Dynamic Lexicon
9	The Design And Implementation Of A Fast Chinese Word Segmentation System Based On Field Text Big Data
10	The Campus Network Core Search Engine Technology - Chinese Word Segmentation