Font Size: a A A

Research And Implement Of Chinese Word Segment Techniques Based On The Conditional Random Field

Posted on:2019-12-27Degree:MasterType:Thesis
Country:ChinaCandidate:X F XuFull Text:PDF
GTID:2428330566499245Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
With the development of artificial intelligence,the robot gradually enters people's daily life.In the process of human-computer interaction,natural language processing is widely used.Chinese word segmentation,as the basic technology of natural language processing,is also one of the hot spots in the field of artificial intelligence.The current Chinese word segmentation algorithm for specific areas has poor performance,which leads to wrong semantic understanding.This paper proposes an improved algorithm for the conditional random field(CRF)model,which improves the precision and recall rate of Chinese word segmentation.First of all,this paper introduces three mainstream word segmentation method.On the basis of comparing their respective advantages and disadvantages,CRF is selected as the word segmentation model of this paper.Aiming at the technical difficulties in the research of Chinese word segmentation,the overall flow of word segmentation system is designed.Secondly,aiming at the problem of lack of part of speech in the preprocessing of word segmentation,this paper proposes a part-of-speech and lexeme lable set(PLLS),and introduces parameters to mark the part of speech.Aiming at the CRF,an improved feature template is proposed.While extracting common features,compound unary feature information is added to improve the recognition ability of out of vocabulary(OOV).Then,the stochastic gradient descent(SGD)method is applied to the training process of CRF,and a method based on feature frequency is proposed to improve the convergence speed of model training.Aiming at the application of model prediction algorithm to PLLS,an improved Viterbi algorithm is proposed in this paper.In the subsequent processing section,reverse maximum matching(RMM)algorithm based on Tire Tree is used to discover ambiguous words.For the ambiguous words found,three disambiguation methods are proposed.Finally,a Chinese word segmentation system is designed by using JAVA language.According to the practical application scene,the corpus for government affairs is collected,constructed and tested,and the test results are analyzed.In comparison with the mainstream segmentation tools,the validity and practicability of the system are verified.
Keywords/Search Tags:Chinese Word Segmentation, Conditional Random Field, Stochastic Gradient Descent method, Viterbi algorithm, Out Of Vocabulary Words, Ambiguous Words
PDF Full Text Request
Related items