Entity resolution in structured records with machine learning methods

Posted on:2013-04-13

Degree:Ph.D

Type:Dissertation

University:State University of New York at Binghamton

Candidate:Shu, Liangcai

Full Text:PDF

GTID:1458390008966855

Subject:Computer Science

Abstract/Summary:

In many applications of Web, bibliographies and business, there is a need to identify whether data objects in the same source or different sources represent the same entity in the real-world. This problem arises for Web data integration, customers in supply chain management, citation matching and users in social networks when there lacks a unique identifier across multiple data sources to represent a real-world entity. Entity resolution (ER) is to identify and discover objects in the data sets that refer to the same entity in the real world. In this dissertation, we identify two types of ER problems in structured records and apply machine learning methods to solving the problems.;For the type I ER problem, we propose a generic framework, namely BARM, which is good for different blocking and matching algorithms to fit into it. Specifically, we focus on blocking algorithms and investigate this problem for large data sets where efficient and scalable solutions are needed. We propose a novel unsupervised blocking algorithm, namely SPectrAl Neighborhood (SPAN), which constructs a fast bipartition tree for the records based on spectral clustering such that real entities can be identified accurately by neighborhood records in the tree. We develop a fast algorithm that performs spectral clustering without computing pairwise similarities explicitly, which dramatically improves the scalability of the standard spectral clustering algorithm.;For the type II ER problem, names (e.g., human names) represent entities of objective. We identify two basic sub-problems---the name sharing problem and the name variant problem. We aim to solve two problems by one model. Different from previous work, our work uses global information of both words and names. We propose a generative latent topic model that involves both names and words---the LDA-dual model. We also propose an approach to learn the model and obtain global information. Based on global information, we propose two algorithms to solve the problems mentioned above.

Keywords/Search Tags:

Entity, Global information, Records, Problem, Propose, Data, Identify, Model

Related items

1	Statistical Model Based Chinese Named Entity Recognition Methods And Its Application To Medical Records
2	Multi-sensor vegetation index and land surface phenology earth science data records in support of global change studies: Data quality challenges and data explorer system
3	Research On Duplicate Records Identification Model In Deep Web
4	The Research And Implementation Of Entity Identification Subsystem In The Data Management System Of Quality And Quantity
5	Research On Efficient Entity Resolution On Heterogeneous Records
6	Design And Implementation Of Medical Records Writing Assistant System Based On Named Entity Recognition
7	An Entity Resolution Approach Based On Attributes Weights And Marked Records
8	Development Of Computational Methods For Extracting Information From Chinese Electronic Medical Records
9	Global-Scale Data Management with Strong Consistency Guarantees
10	Research On Entity Linking Algorithm By Combining The Attention Mechanism And Hidden Semantic Information