Font Size: a A A

Research On Domain Adaptation Method For Chinese Segmentation Based On Instance Transfer Learning

Posted on:2020-10-24Degree:MasterType:Thesis
Country:ChinaCandidate:Y N ZhangFull Text:PDF
GTID:2428330578957242Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Chinese Word Segmentation(CWS)refers to the process of dividing a Chinese character sequence into individual words according to certain norms.It is a basic part of Chinese natural language processing(NLP)and the key basic task of other NLP tasks such as information retrieval,knowledge mapping and machine translation.In recent years,with the development of deep learning,neural network method has been widely used in the field of NLP.Different from the traditional rule-based and statistical word segmentation methods,the neural network method uses large-scale annotated data to train and obtain a model with strong generalization ability.However,the Chinese word segmentation task has strong domain adaptability.When the word segmentation model trained in one field is applied to other fields,the performance will be degraded.The current CWS datasets are mostly in the news field,so how to use resource-rich domain data to improve the word segmentation performance in resource-poor domain becomes more important.This paper mainly studies the domain adaptability of CWS.The current domain adaptation research of CWS mainly has two challenges:on the one hand,the same words may have different contexts and meanings in different fields,which leading to ambiguous segmentation problems;on the other hand,different fields contain different words,which resulting in that the model cannot effectively identify out of vocabulary(OOV).In response to the above challenges,this paper proposed a CWS domain adaptive method based on instance transfer learning.The main idea of this method is to select small-scale valuable instances for labeling,and then use those annotated instances to help train the word segmentation model,thereby improving the domain adaptive ability of CWS model.The main contributions of this paper are as follows:(1)For the general neural Chinese word segmentation system(BiLSTM-CRF),we design two improved schemes that combine the attention mechanism and integrate Bert language model respectively,namely the Att-BiLSTM-CRF framework and the Bert-BiLSTM-CRF framework.The attention mechanism adds historical information,and the Bert language model incorporates more semantic features;(2)For the domain adaptability problem of CWS,by analyzing the characteristics of source data and target data,this paper defined a similarity calculation method based on n-gram character vector.With this method,we can select some target instances that have similar structure with source data and contain a large number of OOV;(3)For the domain adaptability problem of CWS,this paper proposed an adaptive CWS method based on instance transfer learning.In the process of transferring samples,this paper proposed a sampling strategy based on similarity and uncertainty to select instances.Furthermore,we correct the annotated results of the model to avoid negative migration.Our research is a further attempt to use instance transfer learning to improve domain adaptability issues.Experiment results show that our proposed method can effectively enhance the domain adaptive ability of the model and improve the accuracy of Chinese word segmentation.
Keywords/Search Tags:Chinese word segmentation, Domain adaptation, Instance transfer learning, Neural network, Natural language processing
PDF Full Text Request
Related items