Font Size: a A A

Integrating text search and relational databases: Functionality and performance

Posted on:2007-12-18Degree:Ph.DType:Thesis
University:The University of Wisconsin - MadisonCandidate:Ercegovac, VukFull Text:PDF
GTID:2458390005487618Subject:Computer Science
Abstract/Summary:
Applications increasingly involve a mix of free-text documents and traditional relational tables [46]. Commercial relational database management system (RDBMS) store both types of data and support access through keyword search, traditional relational operators in SQL, or a mixed query that combines both. However, application developers lack tools that address functionality and performance concerns that are available for traditional, scalar data, but needed when integrating keyword search in an RDBMS. With regards to functionality, this thesis proposes TextViews as a fully declarative way to specify virtual collections of virtual documents for use with keyword search. For performance, this thesis proposes TEXTURE, a benchmark for comparing RDBMSs given a workload of mixed queries.;Current RDBMSs store a document as a single attribute value and a single collection in a table. TextViews are an adaptation of relational views for defining documents that are composed of multiple documents, possibly stored in multiple tables. Such documents are grouped into a collection and ranked using keyword search. Keyword search can be evaluated by either materializing the TextView, then searching, or by using inverted indexes built on the base table. Inverted indexes do not take advantage of the scalar attributes used in selection and grouping operations that are specified in TextView definitions. Consequently, we propose several alternative indexes for which we demonstrate an order of magnitude improvement in response time for keyword search, with a modest increase in storage when compared to inverted indexes.;The TEXTURE benchmark [28] compares RDBMSs by measuring the response time needed to evaluate a workload of mixed queries. A micro-benchmark design is used to allow fine-grained control for specifying the query workload and data set. In order to support database scale up experiments, TextGen, a novel synthetic text generator was developed and evaluated. TextGen is unique in that it is capable of accurately scaling up an input "seed" text collection, while preserving important data characteristics. The TEXTURE benchmark was used to evaluate three commercial RDBMSs, demonstrating large differences between them for a variety of workloads.
Keywords/Search Tags:Text, Relational, Data, Search, TEXTURE, Documents, Functionality, Rdbmss
Related items