Font Size: a A A

SEEDEEP: A system for exploring and querying deep web data sources

Posted on:2011-02-16Degree:Ph.DType:Dissertation
University:The Ohio State UniversityCandidate:Wang, FanFull Text:PDF
GTID:1468390011471434Subject:Computer Science
Abstract/Summary:
A popular trend in data dissemination involves online data sources that are hidden behind query forms, thus forming what is referred to as the deep web. As compared to the surface web, where HTML pages are static and data is stored as document files, deep web data is stored in backend databases. Dynamic HTML pages are generated only after a user submits a query by filling an online form. Currently, hundreds of large, complex and in many cases, related and/or overlapping, deep web data sources have become available. The number of such data sources is still increasing rapidly every year.;The emergence of the deep web is posing many new challenges in data integration and query answering. First, unlike relational databases, where users have the direct access to the data tables in the databases, the metadata of the deep web databases (database schemas) and the complete set of data tuples stored in deep web databases are hidden from the data integration system. Second, most deep web data sources are created and maintained independently. Thus it is not uncommon for multiple data sources to have data redundancy and data overlap. Furthermore, similar data sources may provide data with different data quality and even conflicting data. Therefore, data source selection is of great importance for a data integration system. Third, deep web data sources in a domain often have inter-dependencies, i.e., the output from one data source may be the input of another data source. Thus, answering a query over a set of deep web data sources often involving accessing a sequence of inter-dependent data sources in an intelligent order. Fourth, the common way of accessing data in deep web data sources is through standardized input interfaces. These interfaces, on one hand, provide a very simple query mechanism. On the other hand, these interfaces significantly constrain the types of queries that could be automatically executed. Finally, all deep web data sources are network based. Querying over such data sources often involve the use of various communication links. Both the data source servers and networking links are vulnerable to congestion and failures. Therefore, handling with fault tolerance issue is also necessary for a data integration system.;In our work, we propose SEEDEEP, an automatic System for Exploring and quErying DEEP web data sources. The SEEDEEP system is able to integrate deep web data sources in a particular domain and provide search functionality on structured SQL queries, online aggregation queries and low selectivity queries for domain users. Currently, the SEEDEEP system is composed of five modules which includes schema mining, query planning, approximate query answering, query reuse and fault tolerance. The schema mining module can help to semi-automatically mine the metadata of deep web data sources and build data models for other modules in the system. The query planning module and the approximate query answering module are the core of the SEEDEEP system. The query planning module takes a structured query as input and generate a query plan over the set of integrated deep web data sources to answer the query based on a cost model. Currently, the query planning module is able to handle with Selection-Projection-Join queries, aggregation queries, and nested queries. For certain queries, it is hard to obtain the exact answer from the deep web within a reasonable period of time due to data access constraint specified on data source's input interfaces or by the data source's designer. These queries could be handled by the approximate query answering module in our SEEDEEP system. The approximate query answering module is able to find approximate answers for online aggregation and low selectivity queries using sampling in an effective and efficient manner. The query reuse module explore the similarity between queries, and accelerate the execution of a query by reusing previous query plans and cached query data. Finally, the fault tolerance module deals with data source unavailability and inaccessibility issues. During the execution of a query plan generated by the query planning module, if some data sources become unavailable or inaccessible, the fault tolerance module re-generate a partial query plan to replace the part of the original query plan which becomes in accessible.;As part of our future work, we will extend our research in the following two directions. First, we will look into the problem of understanding the deep web data, in terms of data quality and data distribution. The knowledge of data quality and data distribution is of great important to many querying problems. However, these information cannot be easily obtained from the deep web due to limited accessibility of the deep web data. As a result, we need to propose efficient and effective approximation methods to address the above problem. Second, we are going to look into query planning problem for more advanced queries, such as correlated nested sub-queries, existence queries (Is-there style queries), data mining queries and semantic queries.
Keywords/Search Tags:DEEP web data sources, SEEDEEP, Queries, System, Approximate query answering module, Query planning module, HTML pages, Fault tolerance
Related items