Data cleaning techniques by means of entity resolution

Posted on:2008-12-17

Degree:Ph.D

Type:Thesis

University:The Pennsylvania State University

Candidate:On, Byung-Won

Full Text:PDF

GTID:2448390005956325

Subject:Computer Science

Abstract/Summary:

Real data are "dirty." Despite active research on integrity constraints enforcement and data cleaning, real data in real database applications are still dirty. To make matters worse, both diverse formats/usages of modern data and demands for large-scale data handling make this problem even harder. In particular, to surmount the challenges for which conventional solutions against this problem no longer work, we focus on one type of problems known as the Entity Resolution (ER)---the process of identifying and merging duplicate entities determined to represent the same real-world object. Despite the fact that the problem has been studied extensively, it is still not trivial to de-duplicate complex entities among a large number of candidates.; In this thesis, we have studied three specialized types of ER problems: (1) the Split Entity Resolution (SER) problem, in which instances of the same entity type mistakenly appear under different name variants; (2) the Mixed Entity Resolution (MER) problem, in which instances of different entities appear together for their homonymous names; and (3) the Grouped Entity Resolution (GER) problem, in which instances of entities do not carry any name or description by which ER techniques can be utilized, and thus the contents of entities are exploited as a group of elements. For each type of problems, we have developed a novel scalable solution. Especially, for the GER problem, we have developed two graph theoretic algorithms---one based on Quasi-Clique and the other based on Bipartite Matching, and experimentally validate the superiority of the proposed solutions.

Keywords/Search Tags:

Data, Entity resolution

Related items

1	Research On Key Techniques Of Entity Resolution For Big Data Integration
2	Research On Entity Resolution Method Of Industrial Internet Of Things Data
3	A Research And Application On Entity Resolution
4	Research On The Method Of Entity Resolution In Big Data Environment
5	Research On Entity Resolution In Integrated Data
6	Entity Resolution Technology Research Based On Multi-Source Data
7	Research On Entity Resolution Towards Uncertain Data Stream And Resource Optimization
8	Entity Resolution For Web Data Integration
9	Study On Entity Resolution Based On Semantics In Data Integration
10	Design and construction of an entity resolution system that supports entity identity information management and asserted resolution