Font Size: a A A

Research On Word Segmentation Based On Probabilistic Model Of Dynamic Lexicon

Posted on:2020-11-08Degree:MasterType:Thesis
Country:ChinaCandidate:K K LiFull Text:PDF
GTID:2438330596497515Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
Chinese word segmentation is to divide the sequential word sequences into meaningful word sequences according to specific specifications.As a basic research in the field of natural language processing,Chinese word segmentation has been widely used in search engine,machine translation,speech recognition and other applications.Therefore,the study of Chinese word segmentation algorithm has certain theoretical and practical significance.In order to solve the word segmentation problem in the upper level application,this paper combines the lexicographical word segmentation method with the statistical word segmentation method,and proposes the probabilistic model word segmentation algorithm based on dynamic word database.Firstly,this algorithm adopts the recognition method of unregistered words based on degree of freedom and degree of cohesion to extract the unregistered words in corpus and add them into the improved dictionary structure as dynamic word database.Secondly,the pre-processed text is preliminarily segmented using the reverse maximum matching based segmentation algorithm,and then the probability of the segmented result is calculated using the naive bayes model.At the same time,the segmentation result with the largest probability is selected and word segmentation ambiguity is resolved.Finally,the hidden markov word segmentation model is designed to make up for the deficiency of the naive bayes model in the word processing.The main research work of this paper is as follows:First,in solving the problem of unregistered words in the field of Chinese word segmentation,by analyzing the shortcomings of the traditional Chinese word segmentation algorithm based on word frequency in dealing with the problem of unregistered words,an unregistered word recognition algorithm based on degree of cohesion and degree of freedom is proposed.Second of all,the dictionary structure was improved,and through the whole word dichotomy dictionary structure in the analysis of the search space and time consumption,a whole word dichotomy found in the unknown words from collection dictionary,the dictionary query time also will increase,the reason is that the query the dictionary text,the invalid traversal,therefore,this paper proposes a dictionary structure based on full binary tree,with the full binary tree instead of the traditional word dichotomy of index layer surface and the dictionary text,by experiments have proved that effective reduces the segmentation time.Third,in the treatment of Chinese word segmentation ambiguity problem,after the comprehensive analysis of the traditional ambiguity resolution algorithm,this paper proposes a probability model based on dynamic thesaurus construction word segmentation algorithm,first with word segmentation precision treat points maximum reverse matching method based on dictionary words for initial segmentation,and then use the directed acyclic graph represent all segmentation results of segmentation path,then the possibility of a shard path using the naive bayes model calculation,choose the path with the highest probability and completed the ambiguity resolution.When analyzing the problems faced by the naive bayes model in the process of word segmentation,the hidden markov model is designed.Based on the above theories and improved algorithms,this paper designs and implements a Chinese word segmentation system with Java language,integrates the main research content and innovation points of this paper,and can efficiently process Chinese sentence word segmentation in practical use...
Keywords/Search Tags:Dynamic lexicon, Naive bayes, Hidden markov model, Chinese word segmentation system
PDF Full Text Request
Related items