Font Size: a A A

Document ranking on weight-partitioned signature files

Posted on:1999-10-09Degree:Ph.DType:Dissertation
University:The Ohio State UniversityCandidate:Ren, LimingFull Text:PDF
GTID:1468390014969866Subject:Computer Science
Abstract/Summary:
A signature file organization for supporting document ranking is proposed. The weight-partitioned signature file employs multiple signature files, each of which corresponds to one term frequency, to represent terms with different term frequencies. Words with the same term frequency in a document are grouped together and hashed into the signature file corresponding to that term frequency. This eliminates the need to explicitly record the term frequency for each word.; We investigate the effect of false drops on retrieval effectiveness if they are not eliminated in the search process. We have shown that false drops introduce insignificant degradation on precision and recall when the false drop probability is below a certain threshold. This is an important result since false drop elimination could become the bottleneck in systems using fast signature file search techniques. We perform an analytical study on the performance of the weight-partitioned signature file under different search strategies and configurations.; We propose several fast heuristic search algorithms to reduce the percentage of signatures that needs to be searched. On average they achieve search reduction ratios in the range of 40% to 65%. We also try to improve the response time by reducing the total size of the signature files. We establish a coarse ranking by searching signature files first; then we do exact text matching on the top documents obtained. For long document collection, we achieve the same precision and recall with 75% less of storage overhead.; In the second part of this dissertation, we propose a new key-based partitioning method, variable-prefix partitioning, to improve signature file search speed. We obtain several fast and accurate analytical functions for estimating search space reduction ratio, page fill factor etc. for the variable-prefix method.
Keywords/Search Tags:Signature file, Document, Ranking, Search, Term frequency
Related items