Font Size: a A A

Research On Domain Adaptation Of Chinese Word Segmentation With Multi-source Features And Data

Posted on:2020-11-26Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhuFull Text:PDF
GTID:2428330578977882Subject:Computer technology major
Abstract/Summary:PDF Full Text Request
With the development of deep learning,neural Chinese word segmentation(WS)models have achieved very high performance on closed domain texts.However,when the application scenario switches from a closed domain to an open domain,the WS performance of these models drops significantly mainly because of the insufficient generalization ability of these models and the limitation of the data scale.Considering these two problems,our paper tries to improve the basic model s generalization ability and expand the scale of the task data.The following is the main contributions of our paper:(1)Improving WS Performance with Additional FeaturesAs a kind of phonogram,most Chinese characters have its own unique graphical rep-resentation and pronunciation.The Wubi Input code can describe the glyph information of the Chinese character,and many similar words have similar graphical representation.Poly-phone have different pronunciations in different contexts,so Pinyin can also represent the semantic information of these words.Besides,in the domain adaptation scenario,some pro-fessional words follow the similar word-formation rules.Therefore,we try to improve WS performance of the basic model with these three additional features.(2)Improving WS Performance with Extra Partially Annotated DataIn the domain adaptation scenario,single manually labeled data usually suffers from limited scale and genre coverage,and it is extremely time and energy-consuming to man-ually annotate new data.Thus,we set out to construct the partially annotated corpus.On the one hand,we develop a scientific annotation flow,including obtaining the annotation data,building the annotation system and formulating the WS criterion so as to obtain high-quality partially annotated data.On the other hand,we propose a data filtering method to filter the partially annotated web data with the unlabeled target domain data.Both of these two methods can significantly improve WS performance in the domain adaptation scenario.However,the partially annotated data contains less WS information,it is necessary to mix a large amount of partially annotated data into the training data,which costs a lot of time.(3)Improving WS Performance with Extra Heterogeneous DataConsidering the disadvantage that partially annotated data contains less WS informa-tion,we exploit extra heterogeneous data to improve the WS performance.Recently,the main methods of using multi-source heterogeneous data suffer from complicated models or error propagration.We propose a method with the corpus feature which can significantly improve WS performance without modifying the model or introducing noise data.Finally,we compare it with the Multi-Task Learning(MTL)method to prove that our method can achieve the similar performance with the MTL.And our method is simpler and more efficient than the MTL.In summary,this paper significantly improves the WS performance,and we sincerely hope that our proposed methods can help to improve the performance of other natural lan-guage processing tasks.
Keywords/Search Tags:Chinese Word Segmentation, Domain Adaptation, Glyph and Pinyin Fea-tures, Partially Annotated Data
PDF Full Text Request
Related items