Research On Chinese Word Segmentation For Domain Literature

Posted on:2019-01-11

Degree:Master

Type:Thesis

Country:China

Candidate:H H Sun

Full Text:PDF

GTID:2428330578972714

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Chinese word segmentation is an important basic technology in the field of Chinese information processing.With the development of the application of Chinese information processing,the number of data produced in various fields is increasing gradually,and the demand for Chinese word segmentation in the professional field is expanding.However,most of the current training corpus for word segmentation are from the language material of general news.So in the cross-domain word segmentation task,it is often unable to achieve better results because of the great difference between the training corpus and the pending text in the word formation characteristics and distribution rules.Therefore,segmentation targeting a specific area becomes a difficult point in the current Chinese word segmentation field.This paper focuses on the Chinese word segmentation on domain literature,word formation measure and related algorithms that are applicable to domain literature are designed to achieve this goal.The main research contents and innovations are as follows:1.According to the special word formation features of the domain literature,a new word formation measure named Term Frequency Deviation(TFD)is defined,and then an unsupervised word segmentation optimization algorithm was designed based on the measure to merger dissipated domain vocabulary.2.The TFD lays special emphasis on the characteristics of domain words formation.So it has some limitations as a separate word measure in segmentation.In this paper,the traditional segmentation measure including Mutual Information is used to assist in the phrase collocation outside the domain word,and the combined correction algorithm between each measures is designed to improve the overall segmentation optimization effect.3.In order to solve the problem of scarce labeled corpora in the domain literature of model training,a learning method for parameter transfer based on neural network is proposed.This transfer method uses the general domain corpus annotated to pre train the initial model,then the model parameter transfer strategy is designed.Finally,the model of professional domain word segmentation is obtained.The Chinese word segmentation method in the domain literature proposed in this paper is to optimize the existing word segmentation results,meanwhile,other traditional word segmentation measures were incorporated during this period,and the knowledge in the common corpus was reused through transfer learning.Extensive experiments on a domain-specific corpus composed of agricultural documents demonstrate that the TFD and word segmentation optimization algorithms and the combined correction algorithm designed in this paper have significantly improved the results compared with traditional segmentation tools.The proposed parameter transfer learning method has a slight FI-value increase in the target domain literature.

Keywords/Search Tags:

Chinese word segmentation, Domain adaptation, Term Frequency Deviation, Transfer learning, Long-short term memory neural network

PDF Full Text Request

Related items

1	Research On Chinese Word Segmentation Method Based On Two-way Long And Short-term Memory Model
2	Chinese Word Segmentation Analysis Based On Bidirectional LSTMN Recurrent Neural Network
3	Research On Chinese Word Segmentation Based On Neural Network
4	Research On Chinese Word Segmentation Based On Deep Learning
5	Research On Chinese Word Segmentation Based On Deep Learning
6	Applied Study On Chinese Word Segmentation Based On Deep Learning
7	Research And Application Of The Short-term Memory Network For Adjusting Gate Length
8	Research On Domain Adaptation For Chinese Word Segmentation Based On Parameter Transfer Learning
9	Research On Shared Bicycle Stock Prediction Based On Long-term And Short-term Memory Neural Network
10	Research Of Chinese Word Segmentation Oriented To Special Domain