Font Size: a A A

Research On Key Technologies Of Entity Resolution For Structured Data

Posted on:2018-04-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:C C SunFull Text:PDF
GTID:1368330572965451Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Entity resolution(ER)is a key aspect of data quality,which is very important to data integration and data mining.This thesis focuses on entity resolution for structured data.Data integration and data mining probably involve multiple datasources,and different datasources describes entities in different ways.The data objects describing the same real world entity are not semantically the same,due to spelling mistakes,different abbreviations,different description formats,attribute values missing,entities' partial attribute values evolving with time(such as age,home address and affiliation)and so on.The entity resolution task is to identify data objects corresponding to the same real world entity in one or multiple datasources.Entity resolution first appeared in social public services such as national census and health sector.Public institutes paid attentions to and relied upon ER long time ago that facilitates ER researches.ER holds a history of several decades,during which many effective ER technologies were proposed.However,existing works do have weaknesses,such as fowllong four aspects:(1)Existing ER approaches either require domain experts to design match rules,or need a large amount of manually labeled training data to learn match rules(also called ER classifiers).(2)In related data of multiple types,entity relationships may promote ER.But existing works cannot carry out the three tasks at the same time:(a)resolving data objects with attribute values missing,(b)resolving data objects corresponding to different entities but with the same names(usually person names),and(c)using entity relationships to improve ER accuracies and optimize resolution orders.(3)Existing works usually use generic clustering algorithms to solve the match decision problems in unsupervised ER,which results in limited ER accuracies because generic clustering algorithms do not take special features of ER into account.(4)In the face of progressive ER requirements,existing progressive ER approaches require optimal blocking or sorting keys,and cannot directly select data in the most redundant regions of dirty datasets.To address the four weaknesses above,the main contributions of the thesis are as following:(1)For supervised ER,we propose a supervised ER approach with genetic algorithm and active learning,which generates effective ER classifiers with a small amount of manually labeled data.The proposed approach learns ER classifiers via genetic evolution,which includes population initialization,individual selection,individual reproduction,gene crossover and mutation.In particular.we propose a set of special crossovers,each of which targets a specific aspect of ER classifiers.Similarity function crossover is responsible for selecting appropriate similarity functions for attribute comparisons;threshold crossover is responsible for setting a proper threshold for each attribute comparison;aggregation crossover is responsible for logically combining multiple attribute comparisons in an effective way.The three special crossovers help generate effective ER classifiers.Meanwhile,the proposed ER approach reduces overloads of manually labeled data via active learning,promoting learning speeds.Experiments over several datasets verify that the proposed approach outperforms existing approaches in accuracy,requiring much fewer manually labeled training data.(2)For joint ER.we propose a graph based iterative joint ER approach to resolve related data of multiple types.At first,build an object relationship graph with a related dataset of multiple types,and combine semantic path based similarity and attribute similarity to decide whether two data objects match.Then,merge matched data objects,and execute local graph contraction to corresponding data object vertexes and their neighborhoods in the object graph.The two operations enrich local semantics of the object graph,and help produce new candidate data object pairs in these neighborhoods for further resolution.In such a way,similarity propagates,and the approach resolves data objects iteratively.With the iterative process,the object graph's semantic becomes richer and richer,which promotes the accuracy of joint ER.The experiments prove that the proposed joint ER approach outperforms existing joint ER approaches and object relationship based single type ER approaches in accuracy(3)For unsupervised ER,we propose a graph clustering based unsupervised ER approach.At first,construct a weighted data object similarity graph with data objects and their similarities.Then,the basic clustering logic is that a cluster iteratively absorbs its nearest neighbor vertex.During clustering,use random walk with restarts on the similarity graph to dynamically compute the similarity between a cluster and a vertex.We propose to sort data objects to optimize clustering order,which promotes the clustering accuracy.We also propose an improved computation method for stationary probability distributions of random walk,which reduces clustering cost.Experiments over real datasets and synthetic datasets verify effectiveness of the proposed graph clustering algorithm in unsupervised ER match decisions(4)For progressive data management requirements,we propose a progressive ER approach based on muti-blocking.The proposed ER approach does not require optimal blocking or sorting keys,and is able to find the most redundant regions in dirty datasets.The approach consists of two stages:the initialization stage and the iterative stage.In the initialization stage,generate candidate data object pairs,and sort them according to match probabilities in a candidate queue.In the iterative stage,each time choose the front candidate pair(the most matchable pair)of the candidate queue to process;dynamically update candidate pairs' match probabilities according to the real-time ER result,and then update the candidate queue.In such a way,the proposed ER approach reduces useless data object comparisons,and optimizes the real-time ER result.Experimentally comparisons over real datasets and synthetic datasets prove that the proposed progressive ER approach based on muti-blocking improves existing works.
Keywords/Search Tags:entity resolution, data quality, structured data, genetic algorithm, joint entity resolution, graph clustering, progressive entity resolution
PDF Full Text Request
Related items