Font Size: a A A

Entity Resolution With Deep Learning

Posted on:2022-05-18Degree:MasterType:Thesis
Country:ChinaCandidate:Y Y NieFull Text:PDF
GTID:2518306524480334Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Data integration is a key task in the field of information retrieval.Among them,entity resolution(ER)task is a key step of data integration,also known as entity matching or duplicate record detection.ER aims to find the data records refer to the same real-world entity in the data from different sources.Early research are dedicated to devising various string-based distance functions.Ob-viously,such unsupervised approaches lack effectiveness and generality,since here does not exist a single metric for all datasets.It is necessary to set thresholds for different datasets manually,which is lack of generality.With the availability of of crowd workers,an alternative research branch is to leverage human intervention in the loop.However,such hybrid human-assisted approaches are not scalable due to the financial budget con-straint.In recent years,the research mainly focuses on the machine learning base algo-rithms.These approaches view the ER problem as a binary classification task and apply traditional classifiers on the hand-crafted features.They can improve ER accuracy to a certain extent,but the dependency on manual feature engineering still hinders generality and robustness.Recently,with the popularity of deep learning,some work improve the performance of ER by devising effective end-to-end deep learning models.Since existing models simply adopt vanilla RNNs to model sequential information,the model architectures are rather simple.Previous studies failed to capture the saliency of words effectively,and failed to identify the importance of different attributes for structured ER,and did not use the recently popular pre-trained language model,so there still plenty of room for accuracy improvement.In this paper,we propose a multi-context attention mechanism(MCA)to fully exploit the semantic context and capture the highly discriminative terms.Firstly,self-attention is proposed to learn dependencies between words in a sentence.Secondly,pair-attention analyzes both input sequences jointly while learning a similarity representation.Thirdly,global-attention is used to assign high weight to discriminative terms.To support struc-tured datasets with multiple attributes,we further propose attribute attention to distinguish important attributes.We conduct extensive experiments with 7 benchmark datasets that are publicly accessible.The experimental results clearly establish our superiority over pre-vious studies.Besides,with the popularity of pre-trained language models(PLM),we try pre-trained language models on ER.In 6 textual datasets,the model with PLM is superior to MCA,which further improves the performance and generalization.Finally,based on the current research status,we discuss the challenges and opportu-nities for further research.
Keywords/Search Tags:Entity Resolution, Deep Learning, Attention Mechanism, Natural Language Processing
PDF Full Text Request
Related items