Font Size: a A A

Study On Entity Resolution Based On Semantics In Data Integration

Posted on:2013-06-27Degree:MasterType:Thesis
Country:ChinaCandidate:C Y ZhangFull Text:PDF
GTID:2298330467974708Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of the Internet, data integration is becoming increasingly important. Data integration integrates multiple different data sources together to provide users with more convenient and unified service. The circumstance that many records refer to a same entity often appears in data integration. The process of searching such records is called entity resolution, which is a very critical step in data integration.To improve the efficiency of entity resolution, some research proposed to reduce the number of matches between records based on ICAR properties. However, it is hard to fully satisfy the representative of ICAR properties. And merging all similar records that refer to the same entity generated by the resolution algorithm is unreasonable and difficult to apply in practical.To enhance the accuracy of entity resolution, it should be noted that there may exist some ownership relationships, interactive relationships or chronological semantic relationships among records, which will largely improve the resolution accuracy. Yet scarcely no study combines the semantic relation with temporal sequence in entity resolution. Therefore, we carry out the research on multiple semantic association based entity resolution.In this paper, we firstly define the concept of semantic coverage between the representative and non-representative of ICAR properties for optimization. The coverage property not only reduces the number of matches between records but also makes the resolution result more reasonable. Furthermore, C-Swoosh algorithm and C-SNW algorithm are proposed on the basis of coverage property. The former method merges similar records that meet the coverage property to replace the initial records without considering the record order, whereas the latter preliminarily sort records according to certain Key value. The C-SNW algorithm utilizes sliding windows to compare similar records as early as possible, then merges the required records, and ultimately achieves the goal of reducing matching numbers.Secondly, a combination of ownership relationships, interactive relationships and chronological semantic relationships are used to improve the accuracy of entity resolution. In this way, we can capture the effect of time evolution on entity through continuous iterations, thus enhancing the accuracy of entity resolution.Finally, experimental results are given to verify the effectiveness of our algorithm.
Keywords/Search Tags:Data integration, entity resolution, ICAR, coverage property, semanticrelationship
PDF Full Text Request
Related items