Research On Entity Resolution In Integrated Data

Posted on:2011-06-29

Degree:Master

Type:Thesis

Country:China

Candidate:X J Zhang

Full Text:PDF

GTID:2178360305450264

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the growing networks and expanding rapidly information, people get a lot of convenience. People for example, are used to publishing information online, finding information, online research. A variety of applications and web services have a large of data resources. In order to use these existing resources effectively, data integration or data mining is needed. However, because of a large of data from various sources and fast updated data, a lot of data may not be up to date for various reasons; integrated data contains a lot of "dirty data". Namely there is some wrong existence in data quality. These things are mainly as follows:spelling problems, error input, illegal value, null value, inconsistent values, use of abbreviations, different naming conventions duplicate, and so on. The data expression differences exist in different databases, so that entities have two or more representations in databases. These duplicate records may result in the establishment of the wrong data mining model, which will cause to the wrong decision analysis. Therefore, it is particularly important to detect two duplicate records in the data warehouse and data integration. To improve the reliability and availability of integrated data, it is important to detect the duplicates and to merge them. This problem is named as entity resolution. This is a new challenge for researchers in the domain of data integration and data warehouse.The goal of entity resolution is to reconcile data references corresponding to the same real world entity. It is a critical component of data integration and data cleaning. On the basis of the existing problem, two methods are proposed in this paper. The first one is the field-independent method based on weighted grade for entity resolution. Another one is collective entity resolution using Quasi-Clique similarity measure. The new ideas in the measures is following:1. According to the thought of grouping, choose some certain key field or some words of the field to divide large data set into many non—intersected small data sets, and then detect and eliminate approximately duplicated records in each small data set, with the introduction of the above steps that should be repeated with other key field or some words of the field. The experiment shows that such algorithm not only has a good detecting precision, but also has better efficiency of time.2. In many domains, some underlying entities have strong ties to certain other entities. For instance, people often interact with their close friends in a social network, while in bibliography domain, researchers who have close interests constitute a relative stable community where they contact frequently. The compactness of the community can express by a kind of graph--Quasi-Clique.3. In this paper we propose a collective entity resolution method which comprehensively utilizes the three methods including Attribute-based similarity, context-based similarity and Quasi-Clique similarity. In particular, we measure relationship similarity using Quasi-Clique which reduces effectively false positive cases and improves the accuracy of entity resolution. For a few Experimental evaluation of a data set that high precision, the efficiency of the method is perfect.

Keywords/Search Tags:

data integration, data warehouse, entity resolution, clustering, Quasi-Clique

PDF Full Text Request

Related items

1	Entity Resolution For Web Data Integration
2	Research On Key Technologies Of Entity Resolution For Structured Data
3	Study On Entity Resolution Based On Semantics In Data Integration
4	Research On Key Techniques Of Entity Resolution For Big Data Integration
5	Research On Entity Resolution In Web Data Integration
6	Research On Entity Resolution For Heterogeneous Big Data Integration
7	Research On Learning-based Entity Parsing Methods In Data Warehouses
8	Research On Data Fusion For Web Data Integration
9	Research On Entity Resolution Method Of Industrial Internet Of Things Data
10	A Research And Application On Entity Resolution