Font Size: a A A

Research And Implementation Of The Data Quality Control Methods In Integrating Heterogeneous Data Sources

Posted on:2016-02-28Degree:MasterType:Thesis
Country:ChinaCandidate:J JiangFull Text:PDF
GTID:2308330464953299Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the information technology, tremendous volume of data in various forms is stored in di?erent data sets. How to e?ectively link and unify these data has become an increasingly important issue in the big data era. The process of merging and unifying these heterogeneous data sets is known as multi-schema data integration. Many online web sites, such as e Tao.com and qunar.com, take data integration as one of the core techniques in their online business systems. However, data quality problems, such as data inconsistencies and missing values, make it hard to integrate these data. The problems above, if not handled well, will definitely lead to more serious data quality problems. In this paper, we work on providing quality control techniques and methods in doing multi-schema data integration. In addition, we design and implement an end-to-end data integration system called Smart Int, which implements our techniques for multi-schema data integration.Specifically, our work covers the following several aspects:(1) Based on our studies and analysis on the related work on schema mapping and record matching, we not only implement a suite of state-of-the-art methods for both schema mapping and record matching, but also design a decision tree based record matching method for matching records based on multiple attributes. According to our experiments, the proposed method can greatly improve the e?ciency as well as increase the accuracy to a certain degree.(2) We propose a novel interactive data integration method, which performs schema mapping and record matching alternatively for reaching a better integration precision and recall. In addition, this interactive data integration method is robust enough to work with those data sets with various data quality problems like typos and missing values.(3) We implement an end-to-end prototype system called Smart Int, centering around the interactive integration algorithm, to demonstrate the interactive process of schema mapping and record matching as well as the integrated results after each iteration. Furthermore, we employ and extend our system to scenarios where demands are generated for integrating multi-source housing data, to further illustrate the significance and the practical value of our system.In this article, we study the problem of data quality in integrating heterogeneous data sources and implement a prototype system that handles the problem well. The techniques proposed in this article makes contributions in both research aspect and practical aspect.
Keywords/Search Tags:Data Fusion, Scheme Mapping, Record Matching
PDF Full Text Request
Related items