Font Size: a A A

Advancing information retrieval through databases, fusion and information extraction

Posted on:2001-01-16Degree:Ph.DType:Dissertation
University:George Mason UniversityCandidate:McCabe, M. CatherineFull Text:PDF
GTID:1468390014952841Subject:Information Science
Abstract/Summary:
This dissertation investigates improvements to Information Retrieval (IR) In three areas: novel uses of database technology for information retrieval, fusion of Information Retrieval strategies in a common environment, and a new kind of relevance feedback—entity-based feedback.; Entity-Based feedback is a novel technique for identifying query expansion terms. Entities are identified using a commercial extraction tool that tags Person, Organization and Location entities. These are stored in the inverted index along with the terms and phrases. Since queries are typically short, the first pass retrieval uses simply terms and phrases. Entities from the top documents are used alone or along with regular feedback terms and phrases. Experimental results show that entities improved retrieval effectiveness in more queries than not. Further filtering of entities is suggested to eliminate those that do not help.; Combining results from disparate IR systems—fusion—has achieved some success in the past. However, disparate systems vary in many system features so it is unclear what contributes to improvements. This research investigates the effectiveness of fusion within a common environment using vector space, probabilistic, and weighted Boolean strategies. Experiments of several thousand combinations using 150 queries against a collection of 528,155 documents (two gigabytes total) were run. The results indicate that these strategies bring back very similar result sets and do not improve with fusion. Further variations in query representation yielded improvement. When both retrieval strategy and query representation are varied, even further improvement is gained.; Using databases for Information Retrieval permits integration of text searches with searches of structured, database data. This research verifies that relational algebra and standard structured query language support leading probabilistic similarity measures. In addition, this work proposes a schema design using multidimensional database (MDB) and Online Analytic Processing (OLAP) which permits advanced, interactive analysis of document collections. Items such as publisher, location, organizations, persons, etc. are pulled from the text and stored for searching, Additional structured data such as corporate or public databases may be linked in. Using OLAP tools, the end-user interactively explores both text and structured data, seamlessly moving through the documents.
Keywords/Search Tags:Information retrieval, Data, Fusion, Using, Structured
Related items