Font Size: a A A

The Research Of Distributed Index Technology Based On Self-indexed Compressed Full-text

Posted on:2016-02-11Degree:MasterType:Thesis
Country:ChinaCandidate:Y T LiuFull Text:PDF
GTID:2308330467982279Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Distributed full-text index, a core technology in the field of informationprocessing, has been widely used for competitive intelligence (CI),informationretrieval (IR), search engine (SE), information filtering (IF) and otherfields. A deep discussion on campus distributed full-text index technology has bothgreat theoretical and commercial value. With the growing popularity of the Internet,all kinds of data are being generated at a faster speed, with the total amount expandingexponentially. In the face of seas of data, relevant data index files continue to increasein size. Thetraditional single index system cannot, basically, meet the requirementsfor indexing massive data, while the distributed index system is able to satisfy suchrequirements and achieve a distributed index for massive data. The core technology ofa distributed index system covers: the creation of the index, data distribution andload-balance of distributed index, and index query. In this paper, the compressedfull-text index─an text processing technology that is very popular in recent years─has been applied to the distributed index system, and the query strategy under thisindex structure has been discussed as well.Contents and innovations for the distributed full-text index technology studied inthis paper include:(1) At present, the reverseindex has been adopted for structuring a majority ofdistributed index systems, for its response to query can reach millisecond level whenrunning on high-performance clusters. The reverse index, however, has to store notonly the information of itself but also additional information, and the additionallystored information is used to support SEs to support functions such as extraction ofstored segments, ordering and positionalinformation, and query cache, etc. As a result,the utilization efficiency of the storage space is relatively low. Originally, this paperhas integrated the compressed full-text self-index─a hotspot for text index researches─into our distributed index system, and proposed a wavelet tree compressionalgorithm based on the improved Huffman coding in combination with the applicationof suffix array, achieving an adaptation to the compressed self-index structure under adistributed environment as well as to the corresponding efficient creation algorithm.(2) An index system for a SE can play following roles: first, create the index forweb files according to certain rules for following queries, and; second, retrieve theindexed files as queried by users, and rank the indexed files according to certain rules and return the result to users. A query strategy under the distributed environment hasbeen proposed on the basis of the improved compressed self-index structure.(3) A framework of the distributed full-text index system has presented withconsideration of above contents and relevant research achievements. The system is infavor of the distributed index for unstructured data of different types, and, therefore,achieves the query and index of a sea of unstructured data. This paper also makes adetailed description to the system’s index cluster and the design of the query cluster.Finally, the efficiency of the query processing of this distributed index system hasbeen tested.
Keywords/Search Tags:distributed full-text index, Compressed Self-Indexed Text, wavelet tree, suffix array, query strategy
PDF Full Text Request
Related items