Font Size: a A A

Design And Implementation Of Chinese Word Segmentation Based On MMSEG Algorithm

Posted on:2017-05-19Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiuFull Text:PDF
GTID:2308330488477160Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the continuous development and prosperity of the Internet, information is also beginning to show explosive growth, volume information across the Internet increasingly large, such a large amount of data on the Internet is a great challenge for the company, whether it is information storage or query information. For search engines, the core technology is the word breaker. For the Chinese search for its core technology is the Chinese word breaker. Chinese word with the English word is quite different, so that the characteristics of the Chinese themselves. If the Chinese word effectively improve segmentation accuracy is an important issue to be solved. This article is in this context proposed study Chinese word segmentation algorithm, and give Chinese word segmentation algorithm’s implementation.This paper focuses on MMSEG algorithms in-depth research, especially in the analysis of its complexity and ambiguity word mode processing rules. Combine Lucene search framework to achieve the Chinese word breaker MMSEG Analyzer on this basis. The main content of the work are as follows:First Lucene search framework conducted in-depth analysis here, including Lucene architecture and indexing technology, in-depth analysis by Lucene for Chinese word breaker MMSEGAnalyzer design provides methodology. The current Chinese word segmentation algorithm in-depth analysis, especially for the Chinese word current major challenges ambiguous word classified and analyzed.Then MMSEG Chinese word segmentation algorithm is analyzed in detail, here mainly from the dictionary implementation, the segmentation algorithm and disambiguation rules. Dictionary implementations there are mainly based on whole-word dictionary structure dichotomy, dictionary-based structure and literally half of the index tree based TRIE dictionary structure and the like. MMSEG algorithm current segmentation algorithm is divided into simple and complex maximum matching algorithm maximum matching algorithm.Finally MMSEGAnalyzer Chinese word breaker for a detailed design and implementation. MMSEGAnalyzer Chinese word breaker implementation can be divided into four blocks: the dictionary management module, segmentation module, ambiguous word processing module, Lucene interface management module. Dictionary management module is mainly responsible dictionary storage, loading and analysis, this paper is loaded from the dictionary, expand and implement Dictionary dictionary automatically load parse three aspects. Ambiguity word processing module according to the four ambiguity processing rules MMSEG algorithms implemented in these complex filtering rules using the maximum matching word segmentation module. Lucene interface management module provides MMSEGAnalyzer Chinese word breaker for the Lucene sub-word access, the realization of Lucene integration.Based MMSEG segmentation algorithm implements MMSEGAnalyzer Chinese word breaker. MMSEGAnalyzer Chinese word can achieve very good Chinese word scenarios through Lucene, greatly improving the accuracy of Chinese word.
Keywords/Search Tags:MMSEG Algorithm, Dictionary, Chinese word segmentation, word segmentation, lucene, segmentation algorithm
PDF Full Text Request
Related items