Font Size: a A A

The Campus Network Core Search Engine Technology - Chinese Word Segmentation

Posted on:2007-01-22Degree:MasterType:Thesis
Country:ChinaCandidate:Z Q MaFull Text:PDF
GTID:2208360215489583Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of computer and network technology, campus network has been developed rapidly as a platform for the information sharing and exchanging. The increase online information on the campus network and distributed storage directly resulted that it is difficult for users to retrieve information, and it lead to the large number of information resources were not used and the information is wasted. Campus network search engine is a system that it gathers the information on campus network resources, then the user could query from it. It includes spider, Chinese word segmentation, indexing and retrieval.It is the subtopic of the campus network search engine topic. The purpose is to provide an efficient Chinese word segmentation software package for the campus network search engine. To achieve this goal, the structure model of the Chinese word subsystem platform and data interfaces between modules have been established; and then, through the study of the structure of the lexicon, and the segmentation algorithm for the identification of unknown words, a set of solution for the campus network search engine-Chinese word segmentation has been provided. It is based on the machine word segmentation and including the reverse lexicon of the establishment and expansion, the two lay indexing structure based binary-seek-by-word, rule-based statistical algorithm of unknown words and improved reverse maximum matching algorithm. Finally, it implements the Chinese word segmentation system and Chinese word segmentation software package, and also tests the speed and memory capability. The result is that the lexicon is occupied 4.28M in memory and the speed of the word segmentation is 11KB/s. This word segmentation software package could meet the needs of current campus network search engine according to the experiment.The paper introduces the Chinese word segmentation subsystem that is implemented in JDK 1.4 and Oracle9i.
Keywords/Search Tags:Campus Network Search Engine, Chinese Word Segmentation, Lexicon Mechanism, Segmentation Algorithm of Maximum Matching
PDF Full Text Request
Related items