Font Size: a A A

Research On Efficient Entity Resolution On Heterogeneous Records

Posted on:2018-04-20Degree:MasterType:Thesis
Country:ChinaCandidate:Y M LinFull Text:PDF
GTID:2348330536981904Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of technology,we have seen an explosion of data,especially in the applications based on computer and web,which facilitates the availability of a large amount of heterogeneous data.However,the heterogeneity prevents people from further using them to create values effectively.Hence,it is critical to clean heterogeneous data,and entity resolution(ER)is one fundamental step.ER is the problem of identifying and merging records that refer to the same realworld entity.In many scenarios,raw records are stored under heterogeneous environment.Specifically,the schemas of records may differ from each other.To leverage such records better,most existing work assume that schema matching and data exchange have been done to convert records under different schemas to those under a predefined schema.However,we observe that schema matching would lose information in some cases,which could be useful or even crucial to ER.To leverage sufficient information from heterogeneous sources,in this paper,we propose HERA(Heterogeneous Entity Resolution Algorithm).To begin with,we address two key challenges: description difference and heterogeneous schema.Furthermore,we show that none of existing similarity metrics or their transformations could be applied to find similar records under heterogeneous settings.Motivated by this,we design the similarity function and propose a novel framework to iteratively find records which refer to the same entity.For the core problem of ER,similarity computation,we present instance-based and schema-based algorithms,to compute records similarity without the apriori knowledge of schema matching between heterogeneous records.Regarding efficiency,we build an effective index to accelerate HERA: based on index,we develop a set of optimization techniques as follows.For each candidate record pairs,we compute a tight upper and lower bound for them to refine candidate sets;we design a graph pruning technique to accelerate similarity computation.Finally,evaluations on real-world datasets show the effectiveness and efficiency of our methods.
Keywords/Search Tags:Entity resolution, data cleaning, heterogeneous records
PDF Full Text Request
Related items