Research On Word Segmentation Based On Probabilistic Model Of Dynamic Lexicon

Posted on:2020-11-08

Degree:Master

Type:Thesis

Country:China

Candidate:K K Li

Full Text:PDF

GTID:2438330596497515

Subject:Electronic and communication engineering

Abstract/Summary:

PDF Full Text Request

Chinese word segmentation is to divide the sequential word sequences into meaningful word sequences according to specific specifications.As a basic research in the field of natural language processing,Chinese word segmentation has been widely used in search engine,machine translation,speech recognition and other applications.Therefore,the study of Chinese word segmentation algorithm has certain theoretical and practical significance.In order to solve the word segmentation problem in the upper level application,this paper combines the lexicographical word segmentation method with the statistical word segmentation method,and proposes the probabilistic model word segmentation algorithm based on dynamic word database.Firstly,this algorithm adopts the recognition method of unregistered words based on degree of freedom and degree of cohesion to extract the unregistered words in corpus and add them into the improved dictionary structure as dynamic word database.Secondly,the pre-processed text is preliminarily segmented using the reverse maximum matching based segmentation algorithm,and then the probability of the segmented result is calculated using the naive bayes model.At the same time,the segmentation result with the largest probability is selected and word segmentation ambiguity is resolved.Finally,the hidden markov word segmentation model is designed to make up for the deficiency of the naive bayes model in the word processing.The main research work of this paper is as follows:First,in solving the problem of unregistered words in the field of Chinese word segmentation,by analyzing the shortcomings of the traditional Chinese word segmentation algorithm based on word frequency in dealing with the problem of unregistered words,an unregistered word recognition algorithm based on degree of cohesion and degree of freedom is proposed.Second of all,the dictionary structure was improved,and through the whole word dichotomy dictionary structure in the analysis of the search space and time consumption,a whole word dichotomy found in the unknown words from collection dictionary,the dictionary query time also will increase,the reason is that the query the dictionary text,the invalid traversal,therefore,this paper proposes a dictionary structure based on full binary tree,with the full binary tree instead of the traditional word dichotomy of index layer surface and the dictionary text,by experiments have proved that effective reduces the segmentation time.Third,in the treatment of Chinese word segmentation ambiguity problem,after the comprehensive analysis of the traditional ambiguity resolution algorithm,this paper proposes a probability model based on dynamic thesaurus construction word segmentation algorithm,first with word segmentation precision treat points maximum reverse matching method based on dictionary words for initial segmentation,and then use the directed acyclic graph represent all segmentation results of segmentation path,then the possibility of a shard path using the naive bayes model calculation,choose the path with the highest probability and completed the ambiguity resolution.When analyzing the problems faced by the naive bayes model in the process of word segmentation,the hidden markov model is designed.Based on the above theories and improved algorithms,this paper designs and implements a Chinese word segmentation system with Java language,integrates the main research content and innovation points of this paper,and can efficiently process Chinese sentence word segmentation in practical use...

Keywords/Search Tags:

Dynamic lexicon, Naive bayes, Hidden markov model, Chinese word segmentation system

PDF Full Text Request

Related items

1	Study The Application And Research Of Hidden Markov Model In Chinese Geo-Entity
2	Research And Implementation Of Chinese Word Segmentation Algorithm
3	Chinese Word Segmentation System Design And Implementation
4	The Research Of Multi-layer Hidden Naive Bayes Algorithm Based On Mutual Information
5	Research And Implementation Of Chinese Lexical Analysis Technology
6	A Research On Chinese Word Segmentation Based On Phonetic Annotation
7	Research And Application Of Chinese Word Segmentation Based On Conditional Random Fields
8	The Effect Of Part Of Speech On Chinese Word Segmentation
9	Study On Disambiguation Algorithm For Chinese Word Segmentation
10	Chinese Text Data Classification