Font Size: a A A

Research And Implementation Of An Open High-Performance Platform Of Full-Text Retrieval

Posted on:2010-10-18Degree:MasterType:Thesis
Country:ChinaCandidate:T Y HongFull Text:PDF
GTID:2178360278970523Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
The explosive growth of information promotes the expeditious development of search engine. General search engines such as Google, Baidu have been proved to be successful. However, on the one hand, their business technology is confidential, on the other hand, developers can't seamlessly embed these general search engines into their applications; besides, it lacks open source search engines which support Chinese well. Therefore, the thesis researches and implements a new Chinese full-text retrieval platform. With high-performance and flexibility, It aims to either be applied into practical field of dynamic data environment, or provide for a feasible of research and experimentation in information retrieval. The main research works and innovations in the thesis are as follows.1. An improved method is presented accounting for the low-performance and poor flexibility problems of the traditional MM(maximum matching) segmentation method. It uses a new dictionary structure based on Hash and Trie Tree structure, which greatly increases the speed of word cutting by 200%. Moreover, freeing itself from fixed maximum matching length, it has more flexibility.2. Aiming at the traditional index structure hard to adapt the dynamic data environments, a new index creating scheme is presented. It includes: (1) improved inverted indexing structure and chain storage perfectly solves the problem of dynamic increasing index data; (2) a novel index merging strategy based on dynamic balance tree; (3) configurable memory allocating strategy based on limited exponent method greatly improves the utilization rate and efficiency of index memory; (4) differential compressing algorithm based on d-gap, which greatly reduces the size of index files by 75% and indirectly reduces I/O times.3. Based on the word automatic segmentation algorithm and index structures, described above, using object-oriented programming with C++ and several design patterns such as factory pattern, we design and implement a high-performance Chinese index platform with flexible architecture and scalability. The subsystems and modules includes index subsystem, searching subsystem, storage subsystem, plug-in managing subsystem and memory managing module.4. At last, based on the index platform, we develop a business searching engine. It creates high-capacity index for all kinds of monitoring data which records users' behaviors of accessing Internet, and provides rapid-response query services. Results from practical use for more than half a year proved the efficiency of the full-text retrieval platform.
Keywords/Search Tags:Full-text retrieval, Chinese segmentation, Reverted index, Index maintenance, Search engine
PDF Full Text Request
Related items