Font Size: a A A

Entity Matching Based On Polymorphic Non-Key Attributes

Posted on:2018-11-16Degree:MasterType:Thesis
Country:ChinaCandidate:Q YangFull Text:PDF
GTID:2348330542465186Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology,the data from different field explosively grow,which makes data quality problems increasingly highlight,such as data distortion,data staleness,data missing,inconsistent data expression and so on.This paper mainly do research on one of the important data quality subjects,so-called Entity Matching(EM for short).EM aims at identifying records referring to the same entity within or across databases.So far,most existing EM algorithms depend on string similarity metrics to measure the similarity between key attribute values of entities and then make decisions according to a predefined similarity threshold.But an arbitrary threshold is bad for either the matching precision or the recall.So as to solve the problems of the existing methods,we propose entity matching algorithms based on polymorphic non-key attributes by analysing textual data and computing similarity do EM.Our methods are orthodox to the existing EM methods based on key attributes.We mainly pay attention on how to use non-key attributes smartly to improve the precision and recall of EM based on key attribute only.More details are shown as follows:(1)We focus on the problem of EM in the paper.Some existing EM methods are introduced here and the advantages and disadvantages of them are also analysed.(2)We propose non-key attributes based EM algorithms which select non-key attributes according to their identification ability to do EM.With the proposed methods,we can not only solve the problem of different expression but also overcome the problem of missing values.(3)We propose textual data based EM algorithms which mine the key information from textual data.The precision and recall of EM using the proposed methods are greatly improved.We demonstrate the effectiveness and availability of the proposed methods on on real-world datasets.Our empirical study shows that our proposed EM methods outperform the state-of-the-art EM methods by reaching a higher EM precision and recall.And the efficiency of EM is also improved greatly by employing the proposed data block algorithm to reduce the times of comparison.
Keywords/Search Tags:Data Quality, Entity Matching, Polymorphic Non-key Attributes Data, Accuracy, Efficiency
PDF Full Text Request
Related items