Font Size: a A A

Information extraction and integration for Web databases

Posted on:2005-09-15Degree:Ph.DType:Thesis
University:Hong Kong University of Science and Technology (People's Republic of China)Candidate:Wang, JiyingFull Text:PDF
GTID:2458390008493432Subject:Computer Science
Abstract/Summary:
A large number off the Web pages returned by filling in search forms are not indexable by most search engines today since they are dynamically generated by querying a back-end (relational or object-relational) database. Referred to as Web databases, such Web sites usually contain complex data objects with nested structures in their Web pages. In this thesis, we address a variety of problems related to retrieving information from Web databases. To extract structured data embedded in template-generated pages from Web databases, we first develop an algorithm to automatically identify the data-rich sections in the page and then propose an innovative approach to automatically induce regular-expression wrappers from them. To understand the semantics of both the query interfaces and the extracted data from various Web databases and integrate them, we propose a combined schema model to describe differentiated schemas in a Web database (global, interface and result schema). We then address two significant schema-matching problems for Web databases, intra-site schema matching and inter-site schema matching, and investigate an instance-based method using domain-specific query probing to solve the two problems at the same time.
Keywords/Search Tags:Web, Schema
Related items