Font Size: a A A

Research And Implementation Of Chinese Word Segmentation Algorithm

Posted on:2017-04-29Degree:MasterType:Thesis
Country:ChinaCandidate:Z QinFull Text:PDF
GTID:2308330485992507Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In this paper,we study the basic problem of natural language processing--Chinese word segmentation.On the basis of common dictionary-based word segmentation algorithm and statistics based word segmentation algorithm,we propose a word segmentation method which combine the two common ways.This method can make full use of the high efficiency based on dictionary and the ability to deal with the strong ambiguity of word segmentation based on statistics.First of all, using the mechanical bidirectional matching method to judge whether the sentence is ambiguous or not.If there is no ambiguity,the result of Chinese word segmentation will be directly regard as input to the Chinese name identification module.If contains ambiguity,This sentence needs to be segmented based on statistical methods.First, use the forward full segmentation algorithm to deal with the sentences and get all the possible cutting conditions.Then, according to the bin-gram the trained language model to calculate the possibility of a variety of segmentation.Three maximum probability are added to the candidate set.The next step uses the evaluation algorithm based on Hidden Markov(HMM) to evaluate the possibility of the emergence of the three types of segmentation.A maximum probability of selection is used as a result of segmentation.Finally, the result are input to the Chinese name recognition module.Then Operation of Chinese name recognition will be carry on.For Chinese name recognition,in this paper, a recognition algorithm based on the combination of rules and statistics is adopted.The output of the person name recognition module is the final processing result.In practice, only a small portion of the Chinese sentence contains ambiguity.This means that the majority of the sentences using a two-way matching algorithm can be solved.A small part of the sentence with statistical methodof word segmentation to eliminate the ambiguity.In this method, the efficiency and accuracy are both considered.The experimental results show a better segmentation effect.The innovation of this paper is to improve the traditional whole word dichotomy dictionary and the double word hash dictionary.And introduce word length array.In the body of the dictionary, the parts are stored separately according to the length and sort.On this account,we can improve the dictionary matching efficiency and reduce the space occupied.In order to use the same one dictionary,we introduce the end word array.And realize the reuse of dictionary.A three layer storage structure is used to store the bin-gram language model to improve the computing speed.Chinese name recognition method are based on the combination of rule and statistics,which Show a better recognition rate.Finally,we implement a Chinese word segmentation system.and provide a convenient operation interface.The system integrates all kinds of dictionary structure and word segmentation methods.And support the addition and deletion of dictionaries and other maintenance operations,in order to be convenient for operation and comparative study.
Keywords/Search Tags:Chinese Word Segmentation, Dictionary-based Word Segmentation, Statistical Word Segmentation, Unknown Word Recognition, Language Model, Hidden Markov Model
PDF Full Text Request
Related items