Font Size: a A A

Research And Implementation Of Real-time Compressed Text Index Technology

Posted on:2015-05-16Degree:MasterType:Thesis
Country:ChinaCandidate:W LuFull Text:PDF
GTID:2298330467963282Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of Internet, the amount of information is much larger, which brings great challenge to information retrieval. Full-text index technology is the key technology in the field of information retrieval such as search engine, information filtering and so on. Full-text index is a data structure built on large text, and it can be used to search any pattern in the original text efficiently.The traditional full-text index technology firstly constructs index from text, and then employs both the index and the text to search for patterns. The size of full-text index is about4to20times of the original text, resulting in huge space consumption. A recent trend is to develop compressed full-text self-index, which employs only the index to search for patterns. Compressed full-text self-index is a kind of self-index technology which does not need to store the original text. And the original text can be reconstructed from the index. In some cases, the index space consumes less than50%of the original text, which saves a lot of space, since the compressed full-text self-index technology achieves a good balance of time and space. In addition, the compressed full-text self-index technology deals with binary data directly, and the process of creating index is semantic irrelevantly, without the need for word segmentation, thus avoiding influence from natural language word segmentation.The main contribution of the thesis is summarized as follows:(1) This thesis mainly summarized the typical algorithms about compressed full-text self-index technology, and then evaluated the algorithms on various data-sets and justified the efficiency of this technology. (2) In order to support fuzzy search function, this thesis researched and implemented wildcard search, edit distance search and regular expression search on the base of compressed full-text self-index technology, which carried on the function extension to text index technology.(3) A high-performance text index system was designed and implemented. This system employed parallel compressed full-text self-index algorithm RLCSA, which helped save space and avoid influence from natural language word segmentation. Additionally, applied on high-performance processors with many-core CPU, this system can execute in a parallel multi-thread way, which increased processing speed. The entire system was Web-based, so that it can be run in different operating systems. The system realized the real-time index of real-time update data such as social network and so on.
Keywords/Search Tags:full-text index, self-index, parallel-computing, datacompression, fuzzy search
PDF Full Text Request
Related items