Font Size: a A A

Design And Implementation Of Chinese Word Segmentation System Based On Grammar

Posted on:2014-08-30Degree:MasterType:Thesis
Country:ChinaCandidate:Y PengFull Text:PDF
GTID:2268330401966895Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of computer and information technology, informationprocessing technology has become an indispensable part of today’s social development.Chinese word segmentation in Chinese information processing, the Chinese charactersequence to split into separate words, and to provide a basis for follow-up treatment. InChinese, the word is the smallest unit of language, only handle the Chinese word basedon the information before they can make further processing. Said Chinese automaticsegmentation is the most important prerequisite for Chinese information processing, butalso what has long been studying the problem.Made for automatic Chinese word article from the words rough cut, the unknownword recognition, disambiguation and POS tagging research and analysis, and the use ofsyntax-based Chinese word segmentation method. The main idea of the syntax-basedChinese word segmentation method: in the preprocessing stage, combined with theshortest path method and segmentation method, using statistics based on N-shortest pathmodel, group N coarse sub optimal results; then segmentation method baseddictionaries and hidden Markov combined segmentation results to further optimize thenew dictionary using hierarchical design thought, expanding basic temporary dictionarydictionary, improved hidden Markov model of this major Baum-Welch algorithm tosolve the parameter problem; solve a sequence of problems and improvements throughimproved Viterbi algorithm for the identification of unknown words, according to themulti-active agent theory, proposed a multi-active agent-based Chinese named entityidentification method, first through the first layer statistics agency, the activity ofself-organizing hidden Markov model, preliminary named entity recognition, and thenuse the second floor Rules agent, to correct the recognition results, eliminate ambiguity,while through the third layer of the match monitoring agent to monitor the activitystatus values for all agents in the system to ensure that the consultation and coordinationbetween the various agents; final speech tagging, marked the classic Viterbi algorithm,the output of the final segmentation results.Finally, designed and implemented based on the syntax of Chinese word segmentation system, and found that the experimental results have a good ability toidentify unknown words, and ambiguity, syntax-based analysis system to eliminatefunctions, can provide new ideas on the theory and practice of the Chinese wordsegmentation.
Keywords/Search Tags:Chinese word segmentation, N-shortest path metho, hidden Markovmodel, Chinese named entity
PDF Full Text Request
Related items