Font Size: a A A

Research On Entity Resolution For Heterogeneous Big Data Integration

Posted on:2019-04-30Degree:MasterType:Thesis
Country:ChinaCandidate:F L ZhangFull Text:PDF
GTID:2348330545481040Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the advent of information age,a great deal of data has been generated and accumulated in all walks of life.People not only need to manage and operate these data,but more importantly,to link these heterogeneous data.After integration,the corresponding analysis can be of great value.One of the key technologies used in big data integration is entity resolution,which is also the basis of big data integration.In the context of large volumes of data,heterogeneous and high levels of noise,it typically employed schema-agnostic blocking techniques to reduce the number of matching records and need to be able to quickly and efficiently complete the record matching work.The research content of this dissertation focuses on the entity resolution technology in big data integration,mainly studying the two parts of its blocking technology and record matching technology.First,aiming at the problem that the traditional blocking technologies that require an an-prior known schema in entity resolution can not be applied to the big data integration,this dissertation proposes a schema-agnostic blocking technology based on tokens which blocking by some redundant comparisons in the context of large volumes of data,heterogeneous and high levels of noise.At the same time,a new pruning scheme based on cumulative weight is proposed on the basis of Meta-blocking technology,which can further help to reduce the redundant comparisons generated by blocking,so as to achieve the goal of improving efficiency.We evaluate the performance of our algorithm through a thorough experimental study over five real-world data sets,with the outcomes verifying significant efficiency enhancements at a negligible cost in effectiveness.Second,aiming at the efficiency of record matching,this dissertation expands the traditional N-gram algorithm based on the idea of local sensitive hashing and redefines the distance formula in the traditional local sensitive hashing algorithm based on Hamming Distance metric formula to solve the defects of local sensitive hashing algorithm that can not be applied to short record matching.Through these technologies,it can not only deal with the noise problem in big data environment,but also can achieve a quickly record matching by use the local sensitive hashing technology.We evaluate the performance of our algorithm with the existing techniques through a thorough experimental study over three data sets,the results show that our algorithm can effectively improve the efficiency of the record matching.
Keywords/Search Tags:big data integration, entity resolution, blocking, record matching
PDF Full Text Request
Related items