Robust information retrieval and graphs

Posted on:2005-09-07

Degree:Ph.D

Type:Thesis

University:The Pennsylvania State University

Candidate:Park, Seung-Taek

Full Text:PDF

GTID:2458390008492269

Subject:Computer Science

Abstract/Summary:

This thesis consists of three important topics on information retrieval research: (a) robust information retrieval on the web, (b) characterization and modelization of network graphs, and (c) design and development of an information retrieval tool by using the knowledge and techniques we obtained from our graph analysis study.; First, the thesis proposes new lexical signature (LS) generative methods for managing broken link (invalid hyperlink) problems. A lexical signature consisting of several keywords from a web document is often sufficient information in order to find the document later, even if its URL has changed. We first extend the performance criteria of LSs by replacing the first unique extraction property with a slightly weaker notation. As long as LSs return the target documents in the first rank, the first criterion is satisfied. Then, we add another criterion so that LSs extract highly relevant (or similar) information when the target documents cannot be found. Other performance criteria of LSs, proposed by previous research but not investigated, are search engine independence, robustness against minor document modification, and minimal conflict between existing and new LSs. We conduct a large-scale empirical study on nine methods for generating LSs, including Phelps and Wilensky's original proposal (PW), seven of our own static variations, and one new dynamic method. We examine their performances on the web over a ten-month period, and on a TREC data set, evaluating their abilities based on the five previously-mentioned performance criteria. DF is good at unique identification but poor at finding relevant documents. PW performs well on the latter data set, but more like DF on the web. TF is easy to compute and maintain, and often performs well, but depends on how the search engine ranks the documents. TFIDF and hybrid methods perform well on both real web data and on the idealized TREC data. We also propose a method called Test and Select (TS) that chooses a LS dynamically. TS attempts to avoid LS conflict at the time a new one is generated. We find that TS dominates all eight static approaches over three different search engines. All LS methods experience significant performance degradation when documents are modified significantly. To ensure reasonable performance, periodic updating of LSs may be required. (Abstract shortened by UMI.)...

Keywords/Search Tags:

Information retrieval, Lss, Performance, Documents, Web

Related items

1	A study of granular computing for the organization and retrieval of XML documents
2	Research On Interactive Information Retrieval From XML Documents
3	Information retrieval for Khmer documents: Challenges and approaches to word segmentation
4	Research And System Implement On Chinese Documents Retrieval Based On Content
5	The Study On The Relatedness Computation For Structural Documents In The Information Retrieval
6	The Research And Application Of Enterprises Documents Search Engines Based On Lucene
7	Studies Of Enhancing The Performance Of WEB Information Retrieval Algorithms
8	Optimizing ranking functions: A connectionist approach to adaptive information retrieval
9	Query seeded concept decompositions and a new approach to performance quantification in information retrieval
10	Adaptive query modification in a probabilistic information retrieval model