Font Size: a A A

Robust information retrieval and graphs

Posted on:2005-09-07Degree:Ph.DType:Thesis
University:The Pennsylvania State UniversityCandidate:Park, Seung-TaekFull Text:PDF
GTID:2458390008492269Subject:Computer Science
Abstract/Summary:
This thesis consists of three important topics on information retrieval research: (a) robust information retrieval on the web, (b) characterization and modelization of network graphs, and (c) design and development of an information retrieval tool by using the knowledge and techniques we obtained from our graph analysis study.; First, the thesis proposes new lexical signature (LS) generative methods for managing broken link (invalid hyperlink) problems. A lexical signature consisting of several keywords from a web document is often sufficient information in order to find the document later, even if its URL has changed. We first extend the performance criteria of LSs by replacing the first unique extraction property with a slightly weaker notation. As long as LSs return the target documents in the first rank, the first criterion is satisfied. Then, we add another criterion so that LSs extract highly relevant (or similar) information when the target documents cannot be found. Other performance criteria of LSs, proposed by previous research but not investigated, are search engine independence, robustness against minor document modification, and minimal conflict between existing and new LSs. We conduct a large-scale empirical study on nine methods for generating LSs, including Phelps and Wilensky's original proposal (PW), seven of our own static variations, and one new dynamic method. We examine their performances on the web over a ten-month period, and on a TREC data set, evaluating their abilities based on the five previously-mentioned performance criteria. DF is good at unique identification but poor at finding relevant documents. PW performs well on the latter data set, but more like DF on the web. TF is easy to compute and maintain, and often performs well, but depends on how the search engine ranks the documents. TFIDF and hybrid methods perform well on both real web data and on the idealized TREC data. We also propose a method called Test and Select (TS) that chooses a LS dynamically. TS attempts to avoid LS conflict at the time a new one is generated. We find that TS dominates all eight static approaches over three different search engines. All LS methods experience significant performance degradation when documents are modified significantly. To ensure reasonable performance, periodic updating of LSs may be required. (Abstract shortened by UMI.)...
Keywords/Search Tags:Information retrieval, Lss, Performance, Documents, Web
Related items