Data extraction and integration of semistructured documents

Posted on:2002-02-07

Degree:Ph.D

Type:Dissertation

University:University of California, Davis

Candidate:Chung, Yip

Full Text:PDF

GTID:1468390014950331

Subject:Computer Science

Abstract/Summary:

Before the vision of the Semantic Web in which data is shared in a meaningful and effective way is realized, we have to deal with large volumes of legacy HTML documents. Information in the documents is buried in the text because HTML is for visual rendering, not for describing the data. State-of-the-art information retrieval techniques rely on keyword-based search engines. They do not support structured queries on the documents. A user may to facilitate visual browsing and data management. Existing approaches do not support an automated integration of heterogeneous documents.; This dissertation aims to address these issues to make information buried in the HTML documents accessible to users and applications. Transforming the whole Web into a structured collection of documents is intractable. Thus, we focus our attention on topic specific HTML documents—documents pertaining to a specific topic, authored by different people from diverse data sources.; We present Quixote, a tool that integrates topic specific HTML documents into XML documents conforming to a global schema. It consists of three components: (1) Document Converter. It extracts information from HTML documents and encodes such information in XML documents. It automatically extracts the information by rules that are insensitive to changes of the data formats and are applicable to diverse sources of data. It does not assume that the documents follow a known format. It only assumes the records within a document follow some regular format. (2) Schema Miner. We propose a new type of approximate schema called majority schema that describes only prevalent structures in a collection of XML documents. The Schema Miner infers a majority schema from the documents, which Document Transformer. It automatically integrates XML documents based on a majority schema discovered. It adapts techniques from schema integration approaches on relational data to XML data. It addresses the unique challenge of preserving semantics of the documents in the integration process since a majority schema does not cover all structures in the documents.

Keywords/Search Tags:

Documents, Data, Integration, Schema

Related items

1	Research On Data Integration And Exchange Technology Of The Agile Virtual Enterprise
2	Research On Global Schema Construction In Web Data Integration
3	Research On Key Technology Of XML-based Data Integration Platform
4	Research On Heterogenous Data Resource Integration System Based On XML And Used In Civil Aricraft Area
5	Research On Capturing Both Types And Constraints In Data Integration
6	Research On Key Technologies Of Equipment Support Heterogeneous Data Integration And Design Of Integration Environment
7	Research On Publishing XML Documents From Enterprise Database
8	A Study On Heterogeneous Data Integration Based On Xml Schema
9	A Study On Heterogeneous Data Integration Based On XML Schema
10	Research On Schema Matching Technology Supporting Massive Heterogeneous Data Integration