Font Size: a A A

Data extraction and integration of semistructured documents

Posted on:2002-02-07Degree:Ph.DType:Dissertation
University:University of California, DavisCandidate:Chung, YipFull Text:PDF
GTID:1468390014950331Subject:Computer Science
Abstract/Summary:
Before the vision of the Semantic Web in which data is shared in a meaningful and effective way is realized, we have to deal with large volumes of legacy HTML documents. Information in the documents is buried in the text because HTML is for visual rendering, not for describing the data. State-of-the-art information retrieval techniques rely on keyword-based search engines. They do not support structured queries on the documents. A user may to facilitate visual browsing and data management. Existing approaches do not support an automated integration of heterogeneous documents.; This dissertation aims to address these issues to make information buried in the HTML documents accessible to users and applications. Transforming the whole Web into a structured collection of documents is intractable. Thus, we focus our attention on topic specific HTML documents—documents pertaining to a specific topic, authored by different people from diverse data sources.; We present Quixote, a tool that integrates topic specific HTML documents into XML documents conforming to a global schema. It consists of three components: (1) Document Converter. It extracts information from HTML documents and encodes such information in XML documents. It automatically extracts the information by rules that are insensitive to changes of the data formats and are applicable to diverse sources of data. It does not assume that the documents follow a known format. It only assumes the records within a document follow some regular format. (2) Schema Miner. We propose a new type of approximate schema called majority schema that describes only prevalent structures in a collection of XML documents. The Schema Miner infers a majority schema from the documents, which Document Transformer. It automatically integrates XML documents based on a majority schema discovered. It adapts techniques from schema integration approaches on relational data to XML data. It addresses the unique challenge of preserving semantics of the documents in the integration process since a majority schema does not cover all structures in the documents.
Keywords/Search Tags:Documents, Data, Integration, Schema
Related items