Font Size: a A A

Research On Key Technologies Of Learned Indexes For Storage Systems

Posted on:2024-05-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:P F LiFull Text:PDF
GTID:1528307319462814Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
A large amount of data have been generated due to the development and prevalence of the global mobile network and artificial intelligence in recent years.To provide efficient data storage and access services for users,the storage systems constantly optimize the index structures with hardware-friendly designs,which provide efficient performance with high throughput,low latency,efficient scalability,and low memory overheads.However,the traditional index structures deliver poor performance in various storage systems,due to the large sizes of the index structures and the inefficient memory jumpings for searching data.Instead,the promising learned indexes train a large number of models to learn the data distributions and leverage the learned models to calculate the positions of data,which significantly increases the search performance and decreases the memory overheads due to the efficient calculations and the small number of parameters.Therefore,the efficient learned indexes are deployed in the storage systems to improve the data storage and access performance.However,existing learned index schemes mainly focus on the designs of algorithms,which fail to work well within various storage systems due to the following challenges,including the low concurrent performance in multi-core memory systems,the poor scalability in disaggregated memory systems,and the inefficiency of identifying duplicate data in data backup systems.To address these challenges,this dissertation investigates the limitations of existing learned indexes and proposes efficient solutions for various storage systems.Existing learned indexes deliver low concurrent performance in multi-core memory systems,due to the severe data dependency and high retraining overheads.To address these problems,this dissertation proposes a fine-grained concurrent learned index scheme,called FINEdex.To ensure that the prediction errors of all models are smaller than the predefined threshold,FINEdex proposes to adaptively train independent models according to the data distributions.By constructing a flattened data structure under each model,FINEdex significantly reduces the thread collisions when multiple threads concurrently access the data due to the low dependency among data.When inserting a large amount of data or the data distribution changes,FINEdex proposes to concurrently retrain the models in two granularities to adjust to the new data distributions.The experimental results show that FINEdex improves 1.8x to 2.5x insertion performance on the dynamic workloads,compared with existing state-of-the-art tree-based and learned index schemes.Existing RDMA-based(Remote Direct Memory Access)index structures fail to efficiently scale to the disaggregated memory systems,due to the bottlenecks of the computing resources and the network bandwidth.To address these problems,this dissertation proposes a scalable RDMA-oriented learned index scheme,called ROLEX.By moving the retraining operation out of the critical path,ROLEX proposes a retrainingdecoupled learned index for efficient data storage.The compute nodes directly access and modify the remote data via one-sided RDMA operations according to the cached learned indexes,while the memory nodes asynchronously retrain the models to improve the model accuracy with consistency guarantees.The experimental results show that ROLEX improves the throughput by 1.3x to 2.8x over existing schemes on the disaggregated memory systems,as well as achieves microsecond-level latency.Existing learned indexes become inefficient to identify duplicate data in the data backup systems,due to the memory overflow and the poor physical locality of storing massive data.To address these problems,this dissertation proposes an efficient deduplication system based on the learned indexes,called Hi De Store.The experiments analyze the distribution of different chunks.Based on the data distributions,Hi De Store proposes to leverage the learned index-based fingerprint cache to classify the hot and cold chunks,where the chunks having a high probability to appear in the subsequent backup versions are identified as hot chunks,while the other chunks are cold ones.Hi De Store respectively stores the hot and cold chunks in active and archival containers to improve the physical locality,as well as updates the chained recipes for future data restoring.The experimental results show that Hi De Store respectively improves the data deduplication and restore performance by up to 1.4x and 1.6x than existing state-of-the-art deduplication schemes in the data backup systems.
Keywords/Search Tags:Learned Indexes, Data Storage Systems, Concurrent Read and Write Optimization, Scalability, Space Optimization
PDF Full Text Request
Related items