Font Size: a A A

DV-Join:A Novel Method Based On Dynamic Indexing And VMC-Filtering For Similarity Join

Posted on:2019-12-11Degree:MasterType:Thesis
Country:ChinaCandidate:R C RuanFull Text:PDF
GTID:2428330545997764Subject:Computer technology
Abstract/Summary:PDF Full Text Request
String similarity join is the basic method used to manage data.Given some string datasets,string similarity join can find out all the similar string pairs from the string datasets.With the development of the Internet information system and Artificial intelligence system,string similarity join is an essential operation for many applications,such as web pages,data integration,bioinformatics etc.,and it is most researched in recent years.In addition,more and more Internet companies have used string similarity join as the foundation for the development of future artificial intelligence.However,the advent of the big data era has led to the increasing popularity of massive data.The existing methods to string similarity join are not efficient for very large dataset.The main limitations are as follows.Firstly,existing string similarity join algorithms are all storing mass data in disk,and then importing memory when computing similarity.These algorithms generate a large number of inverted indexing in memory which will be easy to exceed the memory capacity of a single computer node.Secondly,candidate pairs produced by these methods contains many dissimilar strings,subsequent data validation will takes more time to verify the data.Thirdly,the existing algorithms are based on single machine node,but the memory of a single machine node is often limited and is not easy to expand.A single node is difficult to cope with massive datasets.To address these problems,we propose a novel distributed computing method based on dynamic indexing and VMC-Filtering,called DV-Join.Firstly,DV-Join introduces dynamic indexing,and dynamically adjusts the inverted indexing in the process of computing similarity,which can greatly reduce the memory consumption of inverted indexing.Secondly,The VMC-Filtering algorithm is added to the original filtering mechanism,which can further filter the number of dissimilar subset of candidate pairs and save the verification time of candidate pairs.Thirdly,DV-Join uses the open source cluster distributed computing framework,called Spark,to parallel execute the calculation of mass data in a cluster,greatly reducing the computing time.Experimental results show that DV-Join is more efficient than other existing methods on various large datasets.
Keywords/Search Tags:Dynamic Indexing, VMC-Filtering, Distributed Computing
PDF Full Text Request
Related items