DV-Join:A Novel Method Based On Dynamic Indexing And VMC-Filtering For Similarity Join

Posted on:2019-12-11

Degree:Master

Type:Thesis

Country:China

Candidate:R C Ruan

Full Text:PDF

GTID:2428330545997764

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

String similarity join is the basic method used to manage data.Given some string datasets,string similarity join can find out all the similar string pairs from the string datasets.With the development of the Internet information system and Artificial intelligence system,string similarity join is an essential operation for many applications,such as web pages,data integration,bioinformatics etc.,and it is most researched in recent years.In addition,more and more Internet companies have used string similarity join as the foundation for the development of future artificial intelligence.However,the advent of the big data era has led to the increasing popularity of massive data.The existing methods to string similarity join are not efficient for very large dataset.The main limitations are as follows.Firstly,existing string similarity join algorithms are all storing mass data in disk,and then importing memory when computing similarity.These algorithms generate a large number of inverted indexing in memory which will be easy to exceed the memory capacity of a single computer node.Secondly,candidate pairs produced by these methods contains many dissimilar strings,subsequent data validation will takes more time to verify the data.Thirdly,the existing algorithms are based on single machine node,but the memory of a single machine node is often limited and is not easy to expand.A single node is difficult to cope with massive datasets.To address these problems,we propose a novel distributed computing method based on dynamic indexing and VMC-Filtering,called DV-Join.Firstly,DV-Join introduces dynamic indexing,and dynamically adjusts the inverted indexing in the process of computing similarity,which can greatly reduce the memory consumption of inverted indexing.Secondly,The VMC-Filtering algorithm is added to the original filtering mechanism,which can further filter the number of dissimilar subset of candidate pairs and save the verification time of candidate pairs.Thirdly,DV-Join uses the open source cluster distributed computing framework,called Spark,to parallel execute the calculation of mass data in a cluster,greatly reducing the computing time.Experimental results show that DV-Join is more efficient than other existing methods on various large datasets.

Keywords/Search Tags:

Dynamic Indexing, VMC-Filtering, Distributed Computing

PDF Full Text Request

Related items

1	Research On The Distributed Indexing Platform And Information Filter In Distributed Full-text Retrieval System
2	Distributed indexing and aggregation techniques for peer-to-peer and grid computing
3	Research On Rapid Filtering And Cluster Indexing Method In 3D CAD Model Retrieval
4	NON-PDC Dynamic Output Feedback Control And Distributed H_? Filtering Of T-S Fuzzy System
5	Scalable Solution Of Collaborative Filtering Algorithm Based On Dimension And Distributed Computing
6	Research On Improved Distributed Collaborative Filtering Recommendation Algorithm
7	Research And Implementation Of Dynamic Distributed Computing System For Small And Medium-sized Computer Cluster
8	The Research On Distributed Collaborative Filtering Algorithm
9	Research On Distributed Multilevel Indexing Model For Decentralized Service Repositories
10	The Research Of Distributed Indexing Scheme For Large-scale Semantic Data Based On Linked Data