Font Size: a A A

An Entity Resolution Approach Based On Attributes Weights And Marked Records

Posted on:2014-02-07Degree:MasterType:Thesis
Country:ChinaCandidate:L M ZhenFull Text:PDF
GTID:2348330473451112Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of Internet technology, data is growing faster and faster, and the importance of entity resolution has become particularly prominent. Entity resolution is a process of identifying and merging tuples pointed to the same entity in the real world during the same data source or different data sources. Because there exists possiblly spelling or typographical errors, during the process of storing data, the same entity has different forms of expression, and the data is not a unique identifier. This makes that entity resolution should not be underestimated.How efficiently and accurately to identify records pointed to the same entity has been a goal of the researchers to pursue. In the rule-based matching algorithms, most of the algorithms take all the attributes as matching attributes to be calculated, and consider the weights of the various attributes are the same. However, this does not fully reflect the importance of the key attributes, so it is easy to make errors during the entity resolution. After identifying matching records, many papers do not process matching records, resulting in the redundacy of record matching, so as to reduce the speed of the entity resolution. Therefore, this paper proposes an entity resolution approach based on attributes weights and marked records, to improve the accuracy and the efficiency of the entity resolution.First, this paper focuses on the accuracy of entity resolution in a relational database, and proposes an entity resolution approach based on attributes weights. That adopts mainly information gain and probability statistics methods to calculate the data attribute weights to represent the importance of the attributes in the record, and using top-k technique, so as to achieve improve the object of the entity resolution accuracy and accelerate running time. On this basis, this paper adopts top-k technique to select best matching attributes, and reduce the number of matching attributes, in order to accelerate the speed of entity resolution.Secondly, to improve the efficiency of entity resolution problem, this paper also proposes a merge algorithm based on marked records. That needs to merge these records that point to the same entity actually, thus reducing the number of records to compare, and to mark those records that participated in merging. This avoids related marked records matching again, and reduces the number of records to compare. At last, this algorithm improves the efficiency of entity resolution.Finally, experimental results obtained on real data are given to verify the useness and the effectiveness of the proposed method.
Keywords/Search Tags:entity resolution, attributes weights, information gain, probability statistics, marked records
PDF Full Text Request
Related items