Research On Efficient Entity Resolution On Heterogeneous Records

Posted on:2018-04-20

Degree:Master

Type:Thesis

Country:China

Candidate:Y M Lin

Full Text:PDF

GTID:2348330536981904

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the development of technology,we have seen an explosion of data,especially in the applications based on computer and web,which facilitates the availability of a large amount of heterogeneous data.However,the heterogeneity prevents people from further using them to create values effectively.Hence,it is critical to clean heterogeneous data,and entity resolution(ER)is one fundamental step.ER is the problem of identifying and merging records that refer to the same realworld entity.In many scenarios,raw records are stored under heterogeneous environment.Specifically,the schemas of records may differ from each other.To leverage such records better,most existing work assume that schema matching and data exchange have been done to convert records under different schemas to those under a predefined schema.However,we observe that schema matching would lose information in some cases,which could be useful or even crucial to ER.To leverage sufficient information from heterogeneous sources,in this paper,we propose HERA(Heterogeneous Entity Resolution Algorithm).To begin with,we address two key challenges: description difference and heterogeneous schema.Furthermore,we show that none of existing similarity metrics or their transformations could be applied to find similar records under heterogeneous settings.Motivated by this,we design the similarity function and propose a novel framework to iteratively find records which refer to the same entity.For the core problem of ER,similarity computation,we present instance-based and schema-based algorithms,to compute records similarity without the apriori knowledge of schema matching between heterogeneous records.Regarding efficiency,we build an effective index to accelerate HERA: based on index,we develop a set of optimization techniques as follows.For each candidate record pairs,we compute a tight upper and lower bound for them to refine candidate sets;we design a graph pruning technique to accelerate similarity computation.Finally,evaluations on real-world datasets show the effectiveness and efficiency of our methods.

Keywords/Search Tags:

Entity resolution, data cleaning, heterogeneous records

PDF Full Text Request

Related items

1	Entity resolution in structured records with machine learning methods
2	Goal-Based Entity Resolution
3	The Research And Application Of Duplicated Records And Incomplete Data's Cleaning Approach
4	Research On The Method Of Entity Resolution In Big Data Environment
5	Data cleaning techniques by means of entity resolution
6	Effective Rule-based Algorithms For Data Cleaning
7	Entity Resolution Technology Research Based On Multi-Source Data
8	Research On Key Technologies Of Entity Resolution For Structured Data
9	An Entity Resolution Approach Based On Attributes Weights And Marked Records
10	Research On Data Cleaning Using Web Information