Font Size: a A A

Chinese Word Segmentation Based On Maximum Entropy Method Of Effective Substrings

Posted on:2019-01-29Degree:MasterType:Thesis
Country:ChinaCandidate:M X JiangFull Text:PDF
GTID:2428330575453628Subject:Statistics
Abstract/Summary:PDF Full Text Request
Chinese word segmentation is the basis and key to the subsequent processing and application of Chinese information,so Chinese word segmentation is an important part of Chinese information processing.The research of Chinese word segmentation algorithm has received great attention from people.There are many Chinese word segmentation algorithms,of which the Chinese word segmentation algorithm based on dictionary matching and the Chinese word segmentation algorithm based on word annotation are currently the main Chinese analysis methods.Chinese word segmentation algorithm based on dictionary matching has low computational complexity,relatively simple application,and intuitive understanding.However,when in the presence of ambiguous fields and out-of-vocabulary words,the precision of the Chinese word segmentation will reduce greatly.At the same time,the segment accuracy rate,the segment speed have closely link with the size of the thesaurus.The Chinese segmentation method based on word annotation is based on statistical theory and is a machine learning Chinese word segmentation algorithm.This algorithm is so complexity.However,this method can obtain word features and make rational use of contextual context information during segmentation,and it can achieve good results under out-of-vocabulary words and ambiguities,so it become the main method of Chinese word segmentation at this stage.The Chinese word segmentation specification is still not perfect,which has brought great difficulties to the study of Chinese word segmentation.The maximum entropy method is a method based on word annotation,which can handle the very fine features of Chinese vocabulary,and it have highly discriminative when segment Chinese vocabulary,and it also simple and easy to understand.In the field of word segmentation,the maximum entropy model has been widely used.The Chinese text contains a lot of meaningful stable combination strings.When use the word segmentation based on the maximum entropy method to segment words,the combined string information will be lost.This paper presents the Chinese word segmentation based on maximum entropy method of effective substrings.In this word segmentation method,a method of extracting an effective substring is introduced.That is,all the substrings in the text are obtained to form an initial substring dictionary,and the corresponding frequency is calculated.The initial substring dictionary is used for segment all strings of the training corpus with full match method.When a substring in the initial substring dictionary spans the segmentation marker in the training corpus,delete it from the dictionary,and a substring having a frequency greater than a certain threshold is finally selected as the final substring dictionary.After the effective substring dictionary is obtained,the training corpus will matched segment by this dictionary and later be labeled.Then use maximum entropy model to training the corpus.So we can use the model to predict the test corpus.In the final experiment of this paper,compared with the maximum entropy word segmentation method,the segmentation result of the new method has a certain increase,so we can call it is an effective Chinese word segmentation method.
Keywords/Search Tags:Maximum Entropy Model, Chinese Word Segmentation, Effective Substring, Parameter Estimation, Feature Selection
PDF Full Text Request
Related items