Font Size: a A A

Research On Key Technologies Of Entity Resolution

Posted on:2016-10-21Degree:DoctorType:Dissertation
Country:ChinaCandidate:L L LiFull Text:PDF
GTID:1108330479978715Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Entity resolution plays an important role in data quality management(DQM). It is also an important research area in DQM. A real-world entity may appear in one or multiple databases which may have quite different descriptions. The goal of entity resolution(ER) is to identify the records referring to the same real-world entity from multiple data sources. The result of entity resolution is widely used in other steps of data quality management, such as data cleaning and data quality evaluation. The problem that a real-world entity have quite different descriptions is a common problem that appears in many kinds of application areas. Because of its importance, entity resolution has attracted much attention in the literature. Even though existing methods can perform ER effectively in many cases, these ER approaches have following limitations.1. There are two problems in entity resolution, called “tautonymy” and “synonym”.Tautonymy is different entities may share the identical name and synonym is different names may correspond to the identical entity. However, current research focuses on only one of the problems, without considering the general cases where both of the problems might exist.2. Traditional ER approaches obtain a result based on similarity comparison among records. They assume that records referring to the same entity are more similar to each other, called “compact set property”. However, such property may not hold, so traditional ER approaches cannot identify records correctly in some cases.3. The similarity metrics used by current ER approaches do not consider the correlation between words in records and the major contribution of some specific words which describe the important features of real-world entities in entity identification. As a result,the entity resolution approaches based on current metrics sometimes cannot achieve a high performance.4. Currently, the study of data quality evaluation only includes consistency, currency,completeness and accuracy. However, a new kind of data quality problem can be evaluated according to the result of entity resolution, that is duplicated data have conflicting values in the same attributes. We call this problem as “the entity description conflict”. As far as we know, the evaluation approach of entity description conflict in duplicated data has not been studied.On the basis of the above analysis, in the background of information integration and internet search, focusing on the objectives of minimizing time complexities and maximizing the accuracy of ER result, this thesis investigates the graph-based entity resolution algorithm, the rule-based entity resolution algorithm, the entity resolution algorithm based on distance metric and the data quality evaluation algorithm based on entity resolution result. The main contributions of this thesis are as follows:(1) The problems of “tautonymy” and “synonym” are introduced. As far as we know, this is the first study to address these problems. A general entity identification framework, EIF, is presented in this paper. In this framework, the similarity relationships between records have been modeled as a graph, entities are identified by exploiting the graph clustering algorithms. As an application of EIF, an author identification algorithm is proposed by using the information of author names and co-authors to solve author identification problem. The effectiveness of this framework is verified by extensive experiments. The experimental results show that the author identification algorithm based on EIF outperforms the existing author identification approaches both in precision and recall.(2) The syntax and semantics of the rules for ER are designed, and the independence, consistency, completeness and validity of the rules are defined and analyzed. An e?cient rule discovery algorithm and an e?cient rule-based algorithm for solving entity resolution problem are proposed and analyzed. A rule maintaining method is proposed when entity information is changed. Experiments are performed on real data to verify the effectiveness.(3) By considering the words in every record as the features of entities, two wordfeature-based distance metrics and their learning algorithms are presented for pairwise-ER and groupwise-ER respectively. In these algorithms, each record or each record pair is transformed into a word-based feature vector, and then the best distance metric is learned by a learning algorithm. The extensive experimental study on real data sets verifies the effectiveness and e?ciency of the proposed algorithms.(4) The mathematical model of the entity-description conflict is defined based on the conflicts between attribute-values in a cluster. The problem of computing the range of entity-description conflict is proposed when the accuracy of ER-result is not 100%. To solve the problem, four primary operators are identified, and it is proved that the problem of computing the range of the entity-description conflict is NP-hard. Four approximation algorithms for the four primary operators with ratio bound assurance are provided. A framework based on the four primary operators is proposed for computing the range of the entity-description conflict. Using real-life data and synthetic data, the effectiveness and e?ciency of the proposed algorithm are experimentally verified.
Keywords/Search Tags:entity resolution, data quality, quality evaluation, graph clustering, metric learning
PDF Full Text Request
Related items