Font Size: a A A

Research On On-Line Indexing For Full-Text Retrieval

Posted on:2011-09-28Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhangFull Text:PDF
GTID:2178330338989605Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the expansion of Internet information, more and more information can be achieved, but it is difficult to gain the newest information people need precisely and timely. Indexing construction and maintenance are important sub-components of Search Engine that attempts to build indices for large Web information and update indices real-time. In that case, it can be queried by users for seasonable, precise and comprehensive information. How to construct indices and manage them in on-line enviroment and how to banace the performance between indexing and searching are the main concentration of this paper.In this paper, we launch the topic from the inverted indexing technology which is the main technology in full-text Information Retrieval. Also, we introduce some key technology on indexing and managing indices. Based on the research, our main contribution is in the following fields.1. First, we make a thorough research on index construction based on inverted file and some indexing algorithms. With the requirement and context of on-line index, we design and implement a kind of inerted indexing structure, which surport construting and updating indices efficietly.2. We propose an index managing algorithm called GPDID (Geometric Partition for Deleting Indexed Documents) through the research on the feature of index updating on-line. Compared to the traditional index constructing and updating algorithms, the threadhold value is imported for recoving garbage collection. Through sufficient experiments, it can be proved that our methed improves indexing performance based on documents deleting, and at the same time,it keeps the high searching performance.3. We propose an efficient index construction and management by using dynamic Huffman-like tree. It can dynamically adjust the sequence of sub-index merge operations during index construction using a k-way methods, and offers better query processing performance than previous methods. Through sufficient experiments, we prove that the algorithm performs well in constructing index, query processing, and providing an equivalent level of index maintenance performance when document insertions and deletions exist in parallel.Based on the above researches, in this paper we design and implement an experimental prototype system of full-text retrieval system. This system includes module of document parsing, indexing module, searching sub-system, and indices storage, which can be used as a basic platform for relevant researches and experiments of information extraction.
Keywords/Search Tags:information retrieval, on-line index, garbage collection, index performance, query performance
PDF Full Text Request
Related items