Font Size: a A A

Entity resolution in structured records with machine learning methods

Posted on:2013-04-13Degree:Ph.DType:Dissertation
University:State University of New York at BinghamtonCandidate:Shu, LiangcaiFull Text:PDF
GTID:1458390008966855Subject:Computer Science
Abstract/Summary:
In many applications of Web, bibliographies and business, there is a need to identify whether data objects in the same source or different sources represent the same entity in the real-world. This problem arises for Web data integration, customers in supply chain management, citation matching and users in social networks when there lacks a unique identifier across multiple data sources to represent a real-world entity. Entity resolution (ER) is to identify and discover objects in the data sets that refer to the same entity in the real world. In this dissertation, we identify two types of ER problems in structured records and apply machine learning methods to solving the problems.;For the type I ER problem, we propose a generic framework, namely BARM, which is good for different blocking and matching algorithms to fit into it. Specifically, we focus on blocking algorithms and investigate this problem for large data sets where efficient and scalable solutions are needed. We propose a novel unsupervised blocking algorithm, namely SPectrAl Neighborhood (SPAN), which constructs a fast bipartition tree for the records based on spectral clustering such that real entities can be identified accurately by neighborhood records in the tree. We develop a fast algorithm that performs spectral clustering without computing pairwise similarities explicitly, which dramatically improves the scalability of the standard spectral clustering algorithm.;For the type II ER problem, names (e.g., human names) represent entities of objective. We identify two basic sub-problems---the name sharing problem and the name variant problem. We aim to solve two problems by one model. Different from previous work, our work uses global information of both words and names. We propose a generative latent topic model that involves both names and words---the LDA-dual model. We also propose an approach to learn the model and obtain global information. Based on global information, we propose two algorithms to solve the problems mentioned above.
Keywords/Search Tags:Entity, Global information, Records, Problem, Propose, Data, Identify, Model
Related items