Font Size: a A A

Research On The Query Method Of Set Similarity Based On Length Partition

Posted on:2020-07-28Degree:MasterType:Thesis
Country:ChinaCandidate:J T HuFull Text:PDF
GTID:2438330596497571Subject:Software engineering
Abstract/Summary:PDF Full Text Request
As an efficient means of data representation,sets have been applied in many fields,such as the user's preference for listening to music,the products of shopping websites,and the genetic sequences in bioinformatics engineering.In recent years,with the rapid development of e-commerce,information retrieval,and bioinformatics engineering,the scale and complexity of data collections have increased.Rapid processing of massive and complex set similarity data has been a hot topic in recent years.In the set similarity query calculation,there are often some cases where the set length is too long or short so that the two sets are completely impossible to satisfy the given similarity threshold.It takes a lot of time to calculate these sets.To solve this problem,this paper first proposes a set similarity query method based on length partitioning.Length partition combined with the similarity query algorithm ScanCount.Through data preprocessing,length partitioning and efficient index structure LenSegII(Length Segmented Invert Indexes),the records with impossible similarity can be quickly filtered,thus improving the efficiency of the algorithm.In addition,a more streamlined count array is designed,which reduces space overhead.Experiments on multiple data sets show that the method has higher time and space efficiency.Most of today's set similarity query algorithms work in the way of CPU serial or CPU parallel scan inverted list,so the efficiency and throughput are relatively low,and it is difficult to adapt on large-scale sets' similarity search.With the increase of aggregate data,massive aggregate data needs to have an efficient set similarity query algorithm.To this end,this paper designs a GPU-based parallel partitioned similarity indexing structure LPSM(Length Segmented Signature Matrix),which designs a streamlined feature array to reduce space overhead,and finally uses length partitioning when calculating the similarity using the signature matrix.It is possible to quickly filter records that are unlikely to satisfy similarity,thereby improving algorithm efficiency.
Keywords/Search Tags:Set Similarity, Set Similarity Search, GPU, Information Retrieva
PDF Full Text Request
Related items