Font Size: a A A

Research And Implementation Of Domain Adaptive Chinese Word Segmentation System

Posted on:2018-03-04Degree:MasterType:Thesis
Country:ChinaCandidate:S Y ZhangFull Text:PDF
GTID:2348330512473281Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Chinese word segmentation refers to the process of dividing a continuous character sequence into a reasonable sequence of words according to the specific specification.As one of the most basic step of natural language processing,it is the key link in the process of information retrieval,knowledge acquisition and machine translation.So,the study of Chinese word segmentation has important theoretical and practical significance.This paper proposes a method based on multi-model for Chinese word segmentation.The method constructs a model for each word separately by neural network model structure.Chinese characters have their own semantic information,and different words have different meanings and functions in different contexts.So there are differences in the word-formation rules for different Chinese.Unlike existing character-based tagging method,this method can effectively distinguish the effects of each feature on different characters,and learn the special word-formation rules for each word.By comparing with the single model,CRF method and other related work,the multi-model method proposed in this paper achieves better word segmentation results.On the PKU and MSR Chinese corpus provided by SIGHAN Backoff 2005,the F-scores are 93.4% and 95.5% respectively.On the basis of the above method,this paper proposes an adaptive Chinese word segmentation method based on multi-model for the domain adaptive segmentation task.As the word model is independent of each other,when the model is updated,the word model with strong migration performance is reserved,and the word model with weak migration performance is updated.It solves the problem that the large-scale segmentation data is difficult to be shared,and mixing data from the source domain and the target domain needs retraining.When we preform word segmentation on the target,the domain adaptive ability is realized by the adaptive ability of the model.Since the feature representation of embedding can effectively solve the sparse problem,this paper uses the Embedding feature to present the input feature.Experimental results show that the proposed segmentation method can effectively enhance the domain adaptive ability of Chinese word segmentation.Finally,a domain adaptive Chinese word segmentation system is designed and implemented.The system can segment the input sentence or text by using basic model,and support the addition of relevant domain dictionary.It can also update the basic model according to the domain training data and obtain word segmentation results on relevant domains.
Keywords/Search Tags:Chinese word segmentation, Multi-model, Character-based tagging, domain field, feature embedding
PDF Full Text Request
Related items