Research On Organization Structures Of Memory Data For Massive Storage Systems

Posted on:2020-10-20

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Y Y Sun

Full Text:PDF

GTID:1368330590458917

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

The rapid growth of data poses significant challenges to the performance,capacity,and scalability of storage systems.High-throughput and low-latency query services for massive data consume substantial resources in storage systems.However,the memory data organization structures in existing massive storage systems cannot efficiently support query services.Hash-based data structures have the salient features of constant-scale addressing complexity and fast query responses,especially in terms of time and accuracy,which become a feasible solution to optimize the performance of storage systems.However,since hash functions have the feature of producing hash collisions,it's necessary to fully consider and solve the problems caused by hash collisions in the architecture design and system implementation.For the high-latency and low-efficiency issues due to handling hash collisions in the hash table construction process of existing methods,the cuckoo hashing provides alternative locations for items to mitigate hash collisions and improve index structure construction performance.For the property of tolerating a small number of errors of query results in read-intensive applications(web search,neural network training,etc.),the locality sensitive hashing(LSH)is often used to support approximate nearest neighbor(ANN)queries because of its data locality and dimensionality reduction.However,the two hash algorithms need to handle some system performance bottlenecks,especially the high latency due to hash collisions in the index construction of the cuckoo hashing,and expensive storage overhead for query accuracy in LSH.They become challenging for improving performance and the hot topics in the research of memory data organization structures in massive storage systems.First,this thesis investigates the problem of predicting endless loops in the process of constructing memory data organization structures.The cuckoo hashing utilizes kick-out operations to handle hash collisions,but there is a risk that the kick-out path is too long to form an endless loop.When the cuckoo hashing uses two hash functions,there exist only two kick-out paths for an item insertion.By pre-tracking the relationship of items,the final state of the two paths can be known without performing actual kick-out operations.A cuckoo hashing scheme called SmartCuckoo is proposed to predict the endless loops based on the observations.SmartCuckoo converts the cuckoo hash table into a pseudoforest,and each bucket in the table corresponds to a node in the pseudoforest,while each stored item in the table is an edge in the pseudoforest.The key idea behind SmartCuckoo is to represent the hash relationship between items as a directed pseudoforest and use it to track the item positions to predict the occurrence of endless loops during item insertions.SmartCuckoo avoids the costs of step-by-step detections,and thus saves time overhead and system resources.Experimental results show that SmartCuckoo improves insertion throughput by 25%-75%,and mixed operation throughput by more than 50% compared with traditional schemes.Second,this thesis proposes a concurrent solution to the problem of low efficiency of single-threaded memory data organization structures.For a read,the cuckoo hashing delivers real-time access with O(1)lookup complexity in the worst case,reflecting the fast-read strength.For a write,it may be necessary to perform recursive kick-out operations to kick the item out of its current position and then insert into its candidate,until it finds an empty bucket or determines that the insertion failed,thus resulting in slow-write performance bottleneck.A concurrent cuckoo hash scheme called CoCuckoo is proposed to reduce the write costs.In the pseudoforest,insertion operations on the path-overlapped nodes access the same subgraph,and insertion operations accessing different subgraphs are simultaneously executed without collisions.Therefore,a graph-grained locking mechanism is used to alleviate the lock contentions of threads.Only one thread accesses the shared path at a time to support concurrent read and write operations and improve performance.Experimental results show that CoCuckoo significantly improves 50%-130% throughput compared with the libcuckoo scheme.Third,this thesis investigates the method of mitigating hash collisions for memory data organization structures with multiple hash functions when endless loops cannot be effectively predicted.The frequency of hash collisions in each hash bucket is unbalanced during the hash table construction,namely,some buckets often have hash collisions and become�hot�,and others occur infrequently,which are �cold�.A collision frequency-aware cuckoo hash scheme called MinCounter is proposed to mitigate hash collisions.MinCounter assigns a counter to each bucket in the hash table to record the number of collisions in real-time manner.When a kick-out operation is needed due to a hash collision during the item insertion,MinCounter chooses the �cold� candidate bucket(with minimum counter)to form a kick-out path that is �unbusy�(less collisions),thus mitigating hash collisions and data migration.Experimental results show that MinCounter reduces 20%-55% hash collisions,more than 35%time overhead while increasing 5% table utilization compared with the conventional cuckoo hash schemes.Finally,this thesis investigates the memory data organization structures to support ANN queries for read-intensive applications.The function parameters are randomly set in the traditional locality sensitive hashing(LSH),and a large number of hash tables are needed to ensure the query accuracy,thus resulting in expensive storage and computational overheads.A distribution-aware LSH scheme called DLSH is proposed to solve the problem.The core idea of DLSH is to use the principal component of the data distribution obtained by Principal Component Analysis(PCA)as the projection vectors of hash functions in LSH scheme,further quantitate the weights of functions,and adjust the interval parameters to improve the query accuracy.Meanwhile,the query result set is refined according to the hit frequency of the candidate results to reduce the computational overhead.Experimental evaluations show that DLSH improves 20%-50% query accuracy,25%-60% F-Measure value,while significantly reducing 98% space overhead and 50%-85% time overhead.

Keywords/Search Tags:

Storage Systems, Memory Data Organization Structure, Hash Collision, Approximate Nearest Neighbor Query, Concurrency

PDF Full Text Request

Related items

1	LSH-based Similarity Search Of Web Data
2	Efficient Algorithms For Approximate Aggregation And Nearest Neighbor Queries Over Multi-Dimensional Data
3	Research On K Nearest Neighbor In High Dimension Data
4	Hash-based Approximate Nearest Neighbor Search For High-dimensional Data
5	Research Of Local Sensitive Hash Index Based On Nearest Neighbor Graph
6	Locality Sensitive Hashing Index Based On Neighborhood Collision Counting
7	Research On K-nearest Neighbor Query Technology On High-dimensional Data
8	Multiple Hash Tables Indexing And Optimization For Approximate Nearest Neighbor Search
9	Research Of The Efficient Index Structure For Approximate Nearest Neighbor Algorithm
10	Research On Dimensionality Reduction And Quantification Methods Of Approximate Nearest Neighbor Query For Streaming Data