Integrating text search and relational databases: Functionality and performance

Posted on:2007-12-18

Degree:Ph.D

Type:Thesis

University:The University of Wisconsin - Madison

Candidate:Ercegovac, Vuk

Full Text:PDF

GTID:2458390005487618

Subject:Computer Science

Abstract/Summary:

Applications increasingly involve a mix of free-text documents and traditional relational tables [46]. Commercial relational database management system (RDBMS) store both types of data and support access through keyword search, traditional relational operators in SQL, or a mixed query that combines both. However, application developers lack tools that address functionality and performance concerns that are available for traditional, scalar data, but needed when integrating keyword search in an RDBMS. With regards to functionality, this thesis proposes TextViews as a fully declarative way to specify virtual collections of virtual documents for use with keyword search. For performance, this thesis proposes TEXTURE, a benchmark for comparing RDBMSs given a workload of mixed queries.;Current RDBMSs store a document as a single attribute value and a single collection in a table. TextViews are an adaptation of relational views for defining documents that are composed of multiple documents, possibly stored in multiple tables. Such documents are grouped into a collection and ranked using keyword search. Keyword search can be evaluated by either materializing the TextView, then searching, or by using inverted indexes built on the base table. Inverted indexes do not take advantage of the scalar attributes used in selection and grouping operations that are specified in TextView definitions. Consequently, we propose several alternative indexes for which we demonstrate an order of magnitude improvement in response time for keyword search, with a modest increase in storage when compared to inverted indexes.;The TEXTURE benchmark [28] compares RDBMSs by measuring the response time needed to evaluate a workload of mixed queries. A micro-benchmark design is used to allow fine-grained control for specifying the query workload and data set. In order to support database scale up experiments, TextGen, a novel synthetic text generator was developed and evaluated. TextGen is unique in that it is capable of accurately scaling up an input "seed" text collection, while preserving important data characteristics. The TEXTURE benchmark was used to evaluate three commercial RDBMSs, demonstrating large differences between them for a variety of workloads.

Keywords/Search Tags:

Text, Relational, Data, Search, TEXTURE, Documents, Functionality, Rdbmss

Related items

1	Mapping and storing XML documents to a relational database
2	Accessing relational databases using virtual XML documents
3	Algorithms for generating XML documents from hierarchical views of relational databases
4	Educational Information In A Relational Database Full-text Search Efficiency Improvements And Implementation
5	Design And Implementation Of Multi-source Document Of Full-Text Search System
6	Research On Transformation From XML Documents To OWL Documents
7	Enhancements for the Search Functionality of an Open Source Email Client
8	Text Search Techniques And Optimization Strategies On Hybrid Data
9	How To Embody The Functionality Of The Special Interaest APP In A Big Data World
10	Query Suggestion Techniques For Keyword Search