Font Size: a A A

Design And Implementation Of Chinese Words Auto-Segmentation Module In DRIS

Posted on:2008-09-28Degree:MasterType:Thesis
Country:ChinaCandidate:H XiangFull Text:PDF
GTID:2178360272467359Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
As a kernel technology of information retrieval, Chinese words auto-segmentation uses computer programs to identify Chinese Words automatically. Its result affects the search results of information retrieval and search engine directly. The main purpose of this thesis is to investigate and design a Chinese-English words auto-segmentation module in the digital library system——a retrieval system based on Domain Resource Integration System (DRIS) theory.This thesis introduces the study background, significance, content of the Chinese words auto-segmentation module in DRIS and current progress of the Chinese words auto-segmentation technology. It summarizes the types of Chinese automatic segmentation algorithm and discusses the principles,advantages,disadvantages of four kinds of Chinese words segmentation algorithms based on Chinese dictionary. It concludes the performance evaluation standards of the Chinese words auto-segmentation system and analyzes five difficult problems during the study of the Chinese words auto-segmentation technology. It studies the principles, functions, organizations of DRIS and the search engine based on Lucene.Net. After describing four familiar Chinese words auto-segmentation algorithms and introducing Chinese dictionary, this thesis presents and analyzes the segmentation results of these four segmentation algorithm. Having Taken the actual needs of DRIS into account, forward maximum match method (FMM) based on Chinese dictionary is selected to be the Chinese words auto-segmentation algorithm which is adopted by the Chinese words auto-segmentation module in DRIS. After comprehending the main function and the structure of language analyzer packet Lucene.Net.Analysis, this thesis designs and implements the Chinese words auto-segmentation module Lucene.Net.Analysis.CJK2. It describes the file structure of this module, the initialization process of Chinese dictionary and the disposal process of mixed Chinese and English texts. It also gives the main programs and the process flowcharts of this module. It presents the standard Token results which were indexed by DRIS adopting this module. Finally, this thesis analyzes the indexes merger problem in DRIS, brings forward some corresponding solutions and prospects the follow-up research work in the future.DRIS adopting FMM has an ideal efficiency of Chinese segmentation. It can improve the efficiency of the index and the quality of the retrieval service.
Keywords/Search Tags:Chinese words auto-segmentation, search engine, forward maximum match method, Chinese dictionary
PDF Full Text Request
Related items