Font Size: a A A

Entity Matching Across Multiple Heterogeneous Open Data Sources

Posted on:2018-07-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:C KongFull Text:PDF
GTID:1318330512985358Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the promotion of Internet+ plan,the degree of informationization in various walks of life increases continuously.The Internet platforms have been played a role of social sensors which are employed to understand individual behaviors.To integrate the individual behaviors collected from social sensors,we can analyze and predict individual behaviors,preferences,and habits completely so as to alleviate the contradiction between supply and demand in China's macro economy.However,the Internet data is decentralized,heterogeneous,fragmented,low-quality,etc.Therefore,we desperately require an effective splicing technology to fuse,integrate and analyze the fragmented low-quality data to let Internet platforms play an important role as social sensors better,which is the motivation of this dissertation.Entity Matching is a key problem in many fields,such as data management,information retrieval,machine learning and so on.The early research works can be traced back to the 1940s.After more than half a century of development,entity matching techniques have been widely used in several areas such as data integration,knowledge acquisition and user profiling.However,due to the fragmented data in the Web 2.0 era,it has become a very challenging task,which is still a hot research topic in the academic and industrial circles in recent years.This dissertation propose the node matching algorithm based on social structure,entity matching algorithm across multiple data sources,and semi-supervised learning method for user matching across heterogeneous social net-works according to decentralized,heterogeneous,fragmented,low-quality and interrelated open data under the Internet environment.In this dissertation,the contributions are listed as follows:1.A node matching algorithm based on social network structures.To protect user privacy,we study the node matching in the social network only based on social structure.ANUM algorithm is proposed to tackle this problem considering massive,low-quality and interre-lated nodes of social networks.It uses a few annotation of matched users to partition users into blocks,which reduces the size of candidates of matching users.By extending Fellei-Sunter methods,our proposed algorithm can handle social network similarity complying to continuous distributions.A generative probability model is proposed and solved by the EM algorithm.Simultaneously,missing value problem can also solved when we use the EM algorithm to learning parameters.Experiments have conducted to illustrate the efficiency and effectiveness of ANUM on real social network datasets.2.An entity matching algorithm for multiple heterogeneous data sources.Most of previous lit-eratures have only concerned matching problem between two data sources.However,entity matching across multiple data sources is to be studied further.EMAD algorithm is proposed to address entity matching for multiple,massive,and heterogenous data sources.To reduce the size of candidate pairs,locality sensitive hashing is used to partition entities from dif-ferent data sources into different buckets.In the proposed algorithm,entity matching across multiple data sources is converted into entity matching of two data sources.The exponential distribution family is adopted to fit the distribution of heterogenous attributes of entities.The EM algorithm is used to learn parameters of the probabilistic model and guarantees the con-vergence of EMAD.Experiments on three real-life datasets demonstrate that EMAD achieves an excellent efficiency and executes efficiently.3.A semi-supervised user matching algorithm across heterogeneous social networks.Ground-truth is helpful to improve the accuracy of user matching.However,the scale of ground-truth can hardly be large enough to train the model due to privacy issues or unbalanced data.In this dissertation,the semi-supervised user matching problem is defined across heterogeneous social networks based on massive,heterogeneous,low-quality and interrelated social network data.CSUI algorithm is introduced to match users in a semi-supervised manner.It uses the two-phase blocking schema to limit the size of candidate user pairs.The first phase utilizes locality sensitive hashing to divide users into blocks.In each iteration,users are further partitioned,which dramatically reduces the candidate size.The similarity evaluation method across social networks based on partial matched users is also proposed.The exponential distribution family is adopted to integrate heterogeneous user attributes to build generative probabilistic model.The EM algorithm is used to learn parameters of the probabilistic model and guarantees the convergence of CSUI.The performance of CSUI algorithm is verified by experiments on real-life social network datasets.4.An entity matching based social network user matching and query prototype system.With full consideration of massive,heterogeneous,low-quality and interrelated Internet data,the SumQ prototype is designed and implemented.Two main components of SumQ,i.e.,user matching component and visualization interface are introduced in details.During the demon-stration,four stages of user matching process,i.e.,candidate generation,similarity update,parameter learning and rating prediction are illustrated.The visualization interface displays the matching results and assists users to manually match entities,which helps evaluation of our proposed algorithm.The demo system proves our solution is an integrated effective scheme.In summary,with full consideration of massive,heterogeneous,low-quality and interrelated Internet data,this dissertation solves problems of node matching based on social structure,entity matching cross heterogeneous data sources and matching of social network users.At last,the SumQ prototype system is designed and implemented.Theoretical analysis and experimental results illustrate that our proposed algorithm can handle massive,heterogeneous,low quality and interrelated Internet data in Web 2.0 era to solve the entity matching problem across multiple open data sources.
Keywords/Search Tags:heterogeneous data, entity matching, user matcing, probabilistic model, expo-nential family
PDF Full Text Request
Related items