Font Size: a A A

Structured Data Oriented Entity Resolution Methods

Posted on:2015-05-02Degree:MasterType:Thesis
Country:ChinaCandidate:C W WangFull Text:PDF
GTID:2348330518470242Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Entity resolution is a technology that extracts,matches and merges records which represent the same real world entity in structured and unstructured data. Especially in the big data age, entity resolution is a long-standing challenge in database management, information retrieval, machine learning,natural language processing and statistics. Different scholars refer to it by a variety of names, including record linkage, de-duplication, co-reference resolution,reference reconciliation, object consolidation, etc. Accurate and fast ER has huge practical implications in a wide variety of commercial,scientific and security domains. Otherwise,terrible ER could bring about a series of problems. For instance, incorrect entity resolution would leave repeated data exist, distorting information and leading potential problems for subsequent data mining, decision support system and business intelligence.Unstructured data oriented entity resolution is always application specific, this paper tends to propose a general ER method, so it focuses on structured data. Moreover, it treats the functions for comparing and merging records as black-boxes. The key of the ER method is to decrease the number of calling the comparing and merging functions as much as possible.Firstly, for dataset without data confidences, extracts M-Kernels by training, performs statistical learning for M-Kernels, and partitions the dataset into blocks. The purpose of blocking is to segregate records in different blocks, limiting the comparison and mergence in the same block, and enhancing the ER performance. Secondly, for dataset with data confidences, ER need consider the merging order and choose the best one, for the reason that different merging orders could produce different ER results. The paper proposes the concept of confidence merging dictionary, using it to reduce the calling of merging functions. Besides,the paper puts forward an ER framework processing data confidences.Finally, the experiments show the effectiveness of partitioning records, the performance improvement by confidence merging dictionary, and the feasibility of the ER framework processing data confidences.
Keywords/Search Tags:M-Kernel, Blocking, CMD, BUBBLE
PDF Full Text Request
Related items