Mining knowledge for instance integration in heterogeneous databases

Posted on:1998-03-13

Degree:Ph.D

Type:Dissertation

University:University of Minnesota

Candidate:Ganesh, M

Full Text:PDF

GTID:1468390014478279

Subject:Computer Science

Abstract/Summary:

PDF Full Text Request

Data integration from multiple heterogeneous data sources has become a high-priority task in many large enterprises, to achieve competitive advantage and effective utilization of corporate resources. Success of the integration process is critically dependent on the availability of accurate semantic information on the data contents. Techniques for retrieving this information, directly from the data, are being used to complement human knowledge and to automate some of the data integration tasks. This research investigates the application of data-mining techniques to retrieve knowledge for instance-integration in heterogeneous database systems.;Identifying and integrating all the instances of data items that represent the same real-world entity is an important task, distinct from schema integration. Entity Identification (EI) and attribute-value conflict resolution (AVCR) comprise the instance-integration task. When common key-attributes are not available across different data sources, the rules for EI, and the rules for AVCR, are expressed as combinations of constraints on their attribute values. We have developed a method which allows the users to provide examples of similar data items instead of specifying the instance-integration rules directly. A learning module is then used to extract comprehensive and precise rules from the examples, employing knowledge-discovery techniques.;Well-known classification and clustering techniques are designed for applications which deal with much smaller number of distinct items than what is usually present in database environments. We use distance functions measuring similarity between attribute values, to transform the instance-integration problem into a binary classification problem. A library of distance functions for commonly-occurring attribute types has been developed.;Experimental evaluation on real-world business databases show that this method can achieve complete accuracy in learning simple EI rules and can classify more than 85% records accurately even with relatively complex rules. We have also developed an algorithm to compute record-distances as a function of attribute-distances, thereby making it possible to use heuristic clustering algorithms for EI. Such algorithms greatly improve the efficiency of the process with a minor trade-off in effectiveness.

Keywords/Search Tags:

Data, Integration, Heterogeneous

PDF Full Text Request

Related items

1	Heterogeneous Systems, Data Integration Platform For Research And Implementation For The Publishing Industry
2	Research And Application On XML-based Heterogeneous Data Sources Integration System
3	Research On Key Technology Of Medical Video Data Store Center And HIS Heterogeneous Data Sources Integration
4	Research And Application Of Integration Method For Multi-source Heterogeneous Data
5	Design And Implementation Of Enterprise Heterogeneous Data Integration&Query System Based On XML
6	Implementation Of The Distributed Intelligent Heterogeneous Data Integration System Prototype
7	Heterogeneous Data Integration Technology
8	The Design And Implementation Of Heterogeneous Data Integration Software System
9	Data Storage In The CORBA Based Heterogeneous Data Integration System
10	Research On Mapping Of Heterogeneous Data Integration