Font Size: a A A

Research Of Index In Chinese Full-text Retrieval System

Posted on:2008-03-08Degree:MasterType:Thesis
Country:ChinaCandidate:H J ZhaoFull Text:PDF
GTID:2178360212968332Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Chinese Full-Text Retrieval System is one of the fast developing fields in information industry, and the core of the Chinese retrieval system is the Index device. The paper analyzes several different algorithms of constructing the index device, and compares the related technologies, and then gives the advantages and disadvantages of each and the difficulty of achieving. Finally this paper gives the data structure and a new algorithm model of the index in full-text retrieval system.This paper first summarizes the related technology of index constructing in Chinese Full-Text Retrieval, mainly includes data structure of document indexing, index unit selecting, index compression algorithms.In the foundation of summary above, this paper implements the entire index system using these technologies, such as character based-on Inverted Lists and the variable byte coding compression algorithm. This system includes three functions respectively is: Text pretreatment, index foundation and index updating.In the part of text pretreatment, has realized separation of Chinese, foreign and the special character, and has realized deletion of "stop word" .In the part of index foundation, produces one kind index foundation algorithmbased on traditional Inverted Lists ——Sort-Merge method. This algorithm needs the10 time of sizes for temporary spaces than the source text. In order to solve the problem of oversized temporary space in above algorithms, this paper proposed a new index foundation plan. The index organizational structure of this plan is improved Inverted Lists, and its memory way is mix of chain and order. It not only does not need the extra temporary space, but also enhances the efficiency of index founding. In the process of index founding, using the invariable byte code compression technology to carry on the compression of index, the experiment indicates this compression algorithm reduced the size of index document 20 - 30%.In the part of index renewal, this paper proposed three dynamic index updating strategies based on order memory, and a kind of index dynamic updating algorithm based on chain memory. The index dynamic updating algorithm complex has achieves O (n) based on chain memory.
Keywords/Search Tags:Inverted Lists, Chinese Full-Text Retrieval, index Compression, Index device
PDF Full Text Request
Related items