Font Size: a A A

Research On Entity Resolution Method Based On Multi-attribute Attention Mechanism

Posted on:2020-10-26Degree:MasterType:Thesis
Country:ChinaCandidate:C YuanFull Text:PDF
GTID:2428330578954827Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In the big data environment,massive data resources are generated from multiple data platforms.Multi-source data fusion technology integrates entity information from multiple data sources to provide high-quality analytical data sets for data mining and machine learning tasks.These data sets may contain a large number of duplicate entities,which not only causes waste of resources,but also affects the results of data analysis.Entity resolution technology is a key technology to improve data quality and it can solve data duplication problems.In the real world,the same entity may come from multiple different data platforms.Different data platforms may have different descriptions of the same entity,such as different data format and expression.The task of entity resolution is to find out duplicate entities and clean the data to improve its quality.At present,entity resolution research mainly focuses on duplicate record detection.Most of the existing entity resolution methods are based on feature matching,which extract the similarity characteristics between entity pairs artificially,and design a suitable matching function to judge whether the entity pairs matches.On the one hand,the existing similarity features use the literal similarity of characters or texts,ignoring the semantic information;On the other hand,the role of key attributes is ignored in entity matching so that the different contribution between different attributes cannot be distinguished in an entity matching task which affects the quality and efficiency of entity resolution.In order to solve the above problems,this paper proposes an entity resolution method based on the multi-attribute attention mechanism.The main research contents are as follows:(1)We propose an entity resolution matching model based on multi-attribute attention mechanism.In order to extract the semantic similarity between entity pairs,this paper uses the BERT model to pre-train the table data,and uses the table data to fine-tune the BERT pre-training model to obtain the high-dimensional semantic vector of each character.At the same time,in order to highlight the different contribution of each attribute to the entity matching,we split each tuple in the table into word sequences,and use the double-layer LSTM to model the deep learning of the whole tuple.After splitting a tuple by attributes,we add attention mechanisms above each attribute to highlight the differential contributions of each attribute.(2)We propose a weighted hash blocking method based on attributes.In order to improve the efficiency of entity resolution,we propose a weighted hash blocking method based on attributes.In this paper,we use the semantic relationships between attributes and tuples to calculate the weight information of each attribute for the semantic expression of the tuple.After the local sensitive hash coding of attributes,the semantic expression and weight information of each attribute are used to weighted hash encoding for tuples.We conducted experiments on multiple public datasets.Experiments show that the proposed entity resolution scheme can effectively improve the quality and efficiency of entity resolution,and is more suitable for entity resolution tasks in big data.
Keywords/Search Tags:entity resolution, multi-attribute attention mechanism, local sensitive hash, deep learning
PDF Full Text Request
Related items