Font Size: a A A

Maximum Matching Chinese Word Segmentation Technology Based On Word Classification And Sorting

Posted on:2021-01-11Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhaoFull Text:PDF
GTID:2428330623971023Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of science and technology,natural language processing(NLP)technology has been widely applied to everyone's work and study.As the basis of NLP,Chinese word segmentation technology is widely used in various natural language processing fields,such as intelligent question and answer system,search engine,text retrieval,machine translation,speech recognition system,etc.In most natural language processing application systems,the first step is Chinese word segmentation.The accuracy and efficiency of text segmentation will directly affect the effect of subsequent applications.Efficient and accurate Chinese word segmentation is the basic link of each application system.Only by better solving the problem of Chinese word segmentation,can we better understand the problems at the level of sentence and article.Therefore,mature and perfect Chinese word segmentation technology is an important prerequisite and guarantee for the wider application of natural language processing technology.It has important scientific significance and practical application value to conduct in-depth research on Chinese word segmentation technology.By analyzing the research status of the existing Chinese word segmentation methods,the advantages and disadvantages of various word segmentation methods and the problems in the current Chinese word segmentation methods are summarized.To improve the accuracy and efficiency of Chinese word segmentation,a maximum matching Chinese word segmentation method based on word classification and sorting is proposed.First of all,in order to improve the efficiency of word segmentation,a new structure of word segmentation dictionary is designed.The new word segmentation dictionary adopts the idea of grouping.Words with the same first word and the same word length are divided into a group,and the words in each group are sorted.In each matching process,we only need to search in the corresponding group,which greatly reduces the search range and can greatly improve the search efficiency.Secondly,in view of the shortcomings of the maximum matching algorithm in the word segmentation process,the newly designed Chinese word segmentation dictionary is used to improve the maximum matching algorithm.The improved maximum matching algorithm does not need to set the maximum length in advance.It can jump to reduce the number of words per match according to the length of words in the dictionary,and greatly reduce the search scope in the process of word matching.This method can improve the efficiency of word segmentation in many aspects.In addition,in order to improve the accuracy of word segmentation,the ambiguity segmentation and unregistered words that occur during the word segmentation process are processed.For the processing of ambiguity,the method of establishing ambiguity processing rules and word statistics is used;for the processing of unregistered words,the method of named entity recognition is used,and the recognized new words are added to the dictionary.Then integrate all the steps in word segmentation and design a new Chinese word segmentation process.Finally,a comparative experiment was designed to test the accuracy of word segmentation and the efficiency of word segmentation.The effectiveness of the algorithm was verified through experiments.From the experimental results,the improved maximum matching algorithm has a significant improvement in the speed of word segmentation,and the improved maximum matching algorithm combined with named entity recognition methods also achieves good results in word segmentation accuracy.In addition,the Chinese word segmentation system is also designed and implemented,and provides external interfaces for other systems to call,which can be used as a support system for other advanced natural language processing systems.
Keywords/Search Tags:Natural language processing, Chinese word segmentation, Maximum match, Word classification and sorting
PDF Full Text Request
Related items