Font Size: a A A

Research On Chinese Word Segmentation For Domain Literature

Posted on:2019-01-11Degree:MasterType:Thesis
Country:ChinaCandidate:H H SunFull Text:PDF
GTID:2428330578972714Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Chinese word segmentation is an important basic technology in the field of Chinese information processing.With the development of the application of Chinese information processing,the number of data produced in various fields is increasing gradually,and the demand for Chinese word segmentation in the professional field is expanding.However,most of the current training corpus for word segmentation are from the language material of general news.So in the cross-domain word segmentation task,it is often unable to achieve better results because of the great difference between the training corpus and the pending text in the word formation characteristics and distribution rules.Therefore,segmentation targeting a specific area becomes a difficult point in the current Chinese word segmentation field.This paper focuses on the Chinese word segmentation on domain literature,word formation measure and related algorithms that are applicable to domain literature are designed to achieve this goal.The main research contents and innovations are as follows:1.According to the special word formation features of the domain literature,a new word formation measure named Term Frequency Deviation(TFD)is defined,and then an unsupervised word segmentation optimization algorithm was designed based on the measure to merger dissipated domain vocabulary.2.The TFD lays special emphasis on the characteristics of domain words formation.So it has some limitations as a separate word measure in segmentation.In this paper,the traditional segmentation measure including Mutual Information is used to assist in the phrase collocation outside the domain word,and the combined correction algorithm between each measures is designed to improve the overall segmentation optimization effect.3.In order to solve the problem of scarce labeled corpora in the domain literature of model training,a learning method for parameter transfer based on neural network is proposed.This transfer method uses the general domain corpus annotated to pre train the initial model,then the model parameter transfer strategy is designed.Finally,the model of professional domain word segmentation is obtained.The Chinese word segmentation method in the domain literature proposed in this paper is to optimize the existing word segmentation results,meanwhile,other traditional word segmentation measures were incorporated during this period,and the knowledge in the common corpus was reused through transfer learning.Extensive experiments on a domain-specific corpus composed of agricultural documents demonstrate that the TFD and word segmentation optimization algorithms and the combined correction algorithm designed in this paper have significantly improved the results compared with traditional segmentation tools.The proposed parameter transfer learning method has a slight FI-value increase in the target domain literature.
Keywords/Search Tags:Chinese word segmentation, Domain adaptation, Term Frequency Deviation, Transfer learning, Long-short term memory neural network
PDF Full Text Request
Related items