Font Size: a A A

Entity Resolution Technology Research Based On Multi-Source Data

Posted on:2018-02-06Degree:MasterType:Thesis
Country:ChinaCandidate:W ZhangFull Text:PDF
GTID:2428330545968807Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the information technology,information interaction has produced huge amounts of data,resulting in a large number of redundant data and reducing the quality of data.Entity resolution(ER)plays an important role in data quality management(DQM).A real-world entity may appear in different databases which may have different descriptions.For example,the same user registe two different accounts on taobao and jingdong.The goal of entity resolution is to identify the records referring to the same real-world entity from multiple data sources.The result of entity resolution has a wide range of applications in the data management,e-commerce information search and other fields.The current entity resolution methods mainly solved the problem which finds out the same entity from two different data sources,the accuracy and time efficiency of the existing entity resolution methods are still to be improved.Therefore,this paper will combine multiple attribute character data and heterogeneous network data,the entity recognition algorithm for multiple attribute data and the entity recognition algorithm based on heterogeneous network are studied.This paper mainly focuses on the follow contents:(1)The research implements the entity resolution method for multiple attribute data.On the basis of the traditional model based on the prefix tree,this paper deeply analyzes the limitation of the model,which generated redundant data records.Therefore,the adaptive greedy prefix tree algorithm,which can reduce the number of candidates and match time between data records,is proposed in this paper.Finally,compared with the traditional entity resolution methods on DBLP data set,the algorithm has better efficiency and effectiveness to deal with the multiple attribute character data.(2)The research implements the entity resolution method for heterogeneous network data.This paper deeply analyzes the limitation of the multi-network anchoring algorithm(MNA),which ignored the network topology characterstics of the node.Therefore,the entity resolution algorithm on heterogeneous network based on meta-path,which joined the network topology information,is proposed in this paper.The effectiveness of the algorithm is validated by experiments on real Twitter and Foursquare data sets.(3)Further,to improve the efficiency of entity recognition in the heterogeneous network,ERHN++algorithm is proposed in this paper.Puts forward to use adaptive greedy strategy prefix tree to deal with the attribute information of similarity calculation of the heterogeneous network node.The first step,we use adaptive greedy prefix tree strategy to match different heterogeneous network data nodes of pruning,generating candidate set that may match.Then,for this set of candidates,the ERHN algorithm is used to work out the anchor nodes.By comparing the ERHN algorithm and the ENRH++ algorithm in the data sets of Twitter and Foursquare,the validity of the ENRH++ algorithm is validated and the efficiency is improved.
Keywords/Search Tags:entity resolution, similarity join, heterogeneous network, meta-path
PDF Full Text Request
Related items