Font Size: a A A

The Research And Application Of Chinese Word Segmentation Technology In Search Engine

Posted on:2017-05-18Degree:MasterType:Thesis
Country:ChinaCandidate:G Z WeiFull Text:PDF
GTID:2308330503959898Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Chinese words segmentation is the key technology for computers to carry out Chinese text analysis. Therefore, whether Chinese word segmentation algorithm is good or bad has a direct impact on the practicability of Chinese analysis system.Search engine is one of the important applications of Chinese word segmentation technology. How to get higher precision in shorter time is the focus and hot spot of current relevant research. The maximum matching method is the most commonly-used word segmentation algorithm based on string matching, which is the most widely-used Chinese word segmentation algorithm. By analyzing the disadvantages of maximum matching algorithm, combining with efficient dictionary mechanism of double character hash, this paper proposed improved forward maximum matching method based on double character hash words length grouped dictionary structure, and the performance of word segmentation has been improved obviously. Then matching process was handily used for disambiguation, which was to reduce the wrong word segmentation. Later according to the thoughts of improved algorithm, Lucene Chinese text analysis modules of were designed anew to optimize the search engine system.Tests showed that the improved forward maximum matching method based on double character hash words length grouped dictionary structure proposed in this paper had greater performance than maximum matching method. Work summaries in this paper are as follows:1. Based on the research of the maximum matching method, this paper analyzed three problems of the maximum matching method, and proposed corresponding solutions for each problem.2. According to the disadvantages of maximum matching algorithm, performanceof word segmentation has been improved. And the author also designed dictionary mechanism of double character hash words length grouped and proposed the improved forward maximum matching method based on double character hash words length grouped dictionary structure, to meet the demand of improved algorithm. This algorithm can dynamically select appropriate initial matching position and length for each matching, and search dictionary quickly, which can reduce unnecessary matching consumption. Therefore, both segmentation rate and accuracy have greater improvement than traditional algorithm.3. On the basis of matching process of improved algorithm, combined with the algorithm thought-maximum matching algorithm and back word method, some crossing ambiguities were effectively eliminated. That made the results of word segmentation more accurate.4. By studying the knowledge of search engine and Lucene development kit,simple search engine system based on Lucene was established. According to the thoughts of improved algorithm, Lucene Chinese text analysis modules of were designed anew, which optimized application performance of search engine system based on Lucene.5. Experimental evaluations was conducted for the improved forward maximum matching method based on double character hash words length grouped dictionary structure. Firstly, the same corpus was segmented with different dictionary mechanisms, to test the performance of double character dictionary selected by this paper. Then the same corpus was segmented on the basis of improved algorithm and forward maximum matching method, later the results were compared. Experiments showed that word segmentation rate and precision of algorithm propose by this paper were better than that of forward maximum matching method. In conclusion, the improvement is achieved.
Keywords/Search Tags:Chinese words segmentation, Search engine, Lucene, Forward maximum matching algorithm, Double hash structure, Ambiguity processing
PDF Full Text Request
Related items