Research On The Query Method Of Set Similarity Based On Length Partition

Posted on:2020-07-28

Degree:Master

Type:Thesis

Country:China

Candidate:J T Hu

Full Text:PDF

GTID:2438330596497571

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

As an efficient means of data representation,sets have been applied in many fields,such as the user's preference for listening to music,the products of shopping websites,and the genetic sequences in bioinformatics engineering.In recent years,with the rapid development of e-commerce,information retrieval,and bioinformatics engineering,the scale and complexity of data collections have increased.Rapid processing of massive and complex set similarity data has been a hot topic in recent years.In the set similarity query calculation,there are often some cases where the set length is too long or short so that the two sets are completely impossible to satisfy the given similarity threshold.It takes a lot of time to calculate these sets.To solve this problem,this paper first proposes a set similarity query method based on length partitioning.Length partition combined with the similarity query algorithm ScanCount.Through data preprocessing,length partitioning and efficient index structure LenSegII(Length Segmented Invert Indexes),the records with impossible similarity can be quickly filtered,thus improving the efficiency of the algorithm.In addition,a more streamlined count array is designed,which reduces space overhead.Experiments on multiple data sets show that the method has higher time and space efficiency.Most of today's set similarity query algorithms work in the way of CPU serial or CPU parallel scan inverted list,so the efficiency and throughput are relatively low,and it is difficult to adapt on large-scale sets' similarity search.With the increase of aggregate data,massive aggregate data needs to have an efficient set similarity query algorithm.To this end,this paper designs a GPU-based parallel partitioned similarity indexing structure LPSM(Length Segmented Signature Matrix),which designs a streamlined feature array to reduce space overhead,and finally uses length partitioning when calculating the similarity using the signature matrix.It is possible to quickly filter records that are unlikely to satisfy similarity,thereby improving algorithm efficiency.

Keywords/Search Tags:

PDF Full Text Request

Related items

1	Research And Implementation Of Spatial Text Similarity Search
2	Similarity Graph-based Scientific Literature Search Key Technology Research
3	Research On Similarity Search In Information Network
4	Distributed High-Dimensional Similarity Search with Music Information Retrieval Applications
5	Study On Match Similarity Search
6	Study On Similarity Search For Textual And Spatial Data
7	Research On Similarity Search Based On Hash Function
8	Research On Locality Sensitive Hashing-Based Similarity Search
9	Dynamic Similarity Search Over Encrypted Data
10	A category-based similarity algorithm for semantic similarity in information sharing