Font Size: a A A

Incremental retrieval and ranking of complex patterns from text repositories

Posted on:2008-06-19Degree:M.SType:Thesis
University:The University of Texas at ArlingtonCandidate:Thathireddy, JayakrishnaFull Text:PDF
GTID:2448390005957280Subject:Computer Science
Abstract/Summary:
As the volume of information accessible via the electronic medium (such as the Internet) is staggeringly large and growing rapidly, users have to sift through vast reservoirs of information to retrieve relevant data of their choice. It has been estimated that, the Internet consists of 2.5 billion unique, publicly accessible web-pages and this figure is growing at an alarming rate of 7.3 million pages per day. Currently, the only way to wade through such colossal information is by using search engines (such as Google, Live, Yahoo, etc.). Although the popularity of search engines has increased manifold due to---simplicity of usage, speed of retrieval and amount of results generated, their ability to intelligently retrieve relevant information is significantly hampered due to over-reliance on Boolean operators for data retrieval.;Consider searching for complex patterns involving pattern frequency, proximity, sequence, structural patterns and synonyms. Consider the following examples: (at least 5 occurrences of the phrase "research experiences"), ("metal" near "traders", in any order, within 10 words of each other), ("soya" followed by "plantings", within 5 words of each other), all occurrences of the word ("contract" and its synonyms), and ("France" within the occurrence of "sunflower plantings" and "harvest");The above search patterns are not supported by currently available search engines. Researchers looking for information in domains such as security, biology, legal research etc. need focused, objective and precise semantics to specify such patterns. In order to deal with such complex requirements of specific domain users, the design of a document retrieval system---that consists of a pattern specification language and an efficient detection engine that allows specification of such expressive patterns---is needed.;To address these issues, we have designed a framework (made up of two interdependent yet distinct systems---InfoFilter and InfoSearch) based on an expressive pattern specification language and a set of novel detection algorithms that handle streaming as well as static data. InfoFilter handles pattern detection for streaming data (news feeds, IP packets, etc.) where freshness of search results is paramount. However, in the case of data that resides in the form of large yet static repositories, and when the freshness of data is not critical, the InfoSearch system handles pattern detection using a pre-computed index.;The initial design of InfoSearch, for complex pattern detection, focused on fetching all matching occurrences of the pattern in the data repositories. It further processed all the tuples of the operands that constituted the pattern. In this approach, all answers are generated even if the user is interested in a small number of answers. Generating all answers when the request is only for "k" answers is inefficient in terms of processing time, memory utilization and number of computations. Moreover, the results are generated in the order in which they are detected, thus ignoring the relevancy of results with respect to user preferences. The user may be interested in ranked results which can indicate their quality. In order to address these problems, an incremental approach for complex pattern detection is needed. Moreover, in order to generate results based on the relevancy rather than the order of detection, ranking mechanisms for appropriately filtering the results also needs to be addressed.;In this thesis, we investigate several approaches for incremental detection of complex patterns, retrieval, and ranking of results. We also investigate the need for novel data structures as well as the types of structural meta-data to be associated with the data stored in the index for ranking the fetched results. We propose a novel ranking algorithm that utilizes the structural boundaries of the data to rank results based on the location and occurrence of a complex pattern in a document. We also present algorithms for each operator encountered in a pattern, that are based on ensuring optimal utilization of computational and memory resources in the least possible time. Extensive experiments have been performed to evaluate the scalability, performance, and memory usage of these algorithms on a number of patterns.
Keywords/Search Tags:Pattern, Retrieval, Ranking, Results, Information, Data, Incremental
Related items