Font Size: a A A

Toward Building End-to-End Entity Matching Solution

Posted on:2019-03-17Degree:Ph.DType:Dissertation
University:The University of Wisconsin - MadisonCandidate:Gnanaprakash Christopher, Paul SuganthanFull Text:PDF
GTID:1478390017485287Subject:Computer Science
Abstract/Summary:
Entity matching (EM) finds data records that refer to the same real-world entity. Numerous EM solutions have been proposed. These solutions however suffer from two main problems. First, they are not end-to-end. That is, the EM workflow consists of multiple steps, such as cleaning, blocking, matching, sampling, labeling, debugging, etc. Current work however has focused mostly on blocking and matching, ignoring the remaining steps. Second, most current works are designed primarily for power users. They are very difficult for lay users to use. In this dissertation I develop solutions to address the above two problems. For the first problem, I work together with several colleagues to develop Magellan, an end-to-end EM solution approach. Within the context of Magellan, I develop a solution to help users extract missing attribute values from textual data (so that EM can be performed more accurately). For the second problem, I develop a solution that lay users can use to perform EM end-to-end easily on the cloud, using a cluster of machines, and optionally using crowdsourcing. I then focus on string matching, a special case of EM, and develop an effective end-to-end solution for lay users. Finally, I describe how the above solutions have been implemented (mostly as open-source software) and deployed to solve real-world applications. The open-source implementation of several solutions in particular has been deployed on Kaggle, a large and well-known data science and competition platform with well over 0.5M users.
Keywords/Search Tags:Matching, Solution, End-to-end, Users, Data
Related items