Font Size: a A A

Research On Entity Resolution In Integrated Data

Posted on:2011-06-29Degree:MasterType:Thesis
Country:ChinaCandidate:X J ZhangFull Text:PDF
GTID:2178360305450264Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the growing networks and expanding rapidly information, people get a lot of convenience. People for example, are used to publishing information online, finding information, online research. A variety of applications and web services have a large of data resources. In order to use these existing resources effectively, data integration or data mining is needed. However, because of a large of data from various sources and fast updated data, a lot of data may not be up to date for various reasons; integrated data contains a lot of "dirty data". Namely there is some wrong existence in data quality. These things are mainly as follows:spelling problems, error input, illegal value, null value, inconsistent values, use of abbreviations, different naming conventions duplicate, and so on. The data expression differences exist in different databases, so that entities have two or more representations in databases. These duplicate records may result in the establishment of the wrong data mining model, which will cause to the wrong decision analysis. Therefore, it is particularly important to detect two duplicate records in the data warehouse and data integration. To improve the reliability and availability of integrated data, it is important to detect the duplicates and to merge them. This problem is named as entity resolution. This is a new challenge for researchers in the domain of data integration and data warehouse.The goal of entity resolution is to reconcile data references corresponding to the same real world entity. It is a critical component of data integration and data cleaning. On the basis of the existing problem, two methods are proposed in this paper. The first one is the field-independent method based on weighted grade for entity resolution. Another one is collective entity resolution using Quasi-Clique similarity measure. The new ideas in the measures is following:1. According to the thought of grouping, choose some certain key field or some words of the field to divide large data set into many non—intersected small data sets, and then detect and eliminate approximately duplicated records in each small data set, with the introduction of the above steps that should be repeated with other key field or some words of the field. The experiment shows that such algorithm not only has a good detecting precision, but also has better efficiency of time.2. In many domains, some underlying entities have strong ties to certain other entities. For instance, people often interact with their close friends in a social network, while in bibliography domain, researchers who have close interests constitute a relative stable community where they contact frequently. The compactness of the community can express by a kind of graph--Quasi-Clique.3. In this paper we propose a collective entity resolution method which comprehensively utilizes the three methods including Attribute-based similarity, context-based similarity and Quasi-Clique similarity. In particular, we measure relationship similarity using Quasi-Clique which reduces effectively false positive cases and improves the accuracy of entity resolution. For a few Experimental evaluation of a data set that high precision, the efficiency of the method is perfect.
Keywords/Search Tags:data integration, data warehouse, entity resolution, clustering, Quasi-Clique
PDF Full Text Request
Related items