Font Size: a A A

Research And Implementation Of Scalable Learned Index Based On String Keys

Posted on:2023-06-14Degree:MasterType:Thesis
Country:ChinaCandidate:P Y JinFull Text:PDF
GTID:2558306914456354Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Under the background of big data era,efficient data retrieval has become an urgent need for database workers.Building index on data can improve the efficiency of retrieval.However,when the amount of data is too large,the traditional index structure occupies too much memory and cannot be fully loaded into the memory during retrieval,which seriously affects the efficiency of data query.Machine learning can explore the potential distribution of data,and the model itself has the characteristics of low storage cost and high query efficiency.Therefore,using machine learning models to replace traditional index structures provides new ideas for index optimization.The current learned index still has the following shortcomings:1)Building an existing index structure on string keys and querying them will result in high computational costs and reduce the query efficiency of the index.2)Non-primary keys do not have regular distribution and contain duplicate keys,so they cannot be directly applied to secondary index.In view of the above two problems,this thesis studies the learned index technology,and the specific work is as follows:1.Aiming at the problem of decrease retrieval efficiency caused by the increase of computational cost when building learned index directly on string keys,this thesis presents a new learned index for string keys called SLI.In this thesis,SLI index constructs a two-layer model structure from bottom to top by feature extraction algorithm based on shortest unique substring and data partitioning algorithm based on error threshold.At the same time,in order to alleviate the cost of root node retraining caused by model updating,this thesis further presents the KMSLI,which uses keymodel pair to replace the root node to achieve the purpose of rapid updating.The experimental results show that,compared with the traditional index structure,the memory usage of SLI is reduced by 79%,the query performance is increased by 1.76 ×,and the update efficiency is increased by nearly 1.3 ×.2.Aiming at the problem that non-primary keys are not distributed and contain duplicate keys,a two-stage secondary learned index structure is proposed in this thesis called SISLI.The first stage filters out duplicate non-primary keys and unique non-primary keys through a learned Bloom filter.In the second stage,learned index is constructed on two sets of nonprimary keys according to the filtering results to meet the demand of secondary index of the query.The experimental results show that the SISLI on memory footprint is better than traditional index structure.The experimental results show that the SISLI reduces the memory footprint by 32%compared to B-Tree.3.Based on the above-mentioned primary key learned index and secondary learned index,this thesis proposes the algorithm of point query and range query.Firstly,put the SLI,traditional index and the existing learned index ALEX verification on the URL and domain data sets.The experimental results show that on the two types of data sets,SLI has significantly increased in query performance.Secondly,on the taxis in NYC open source data set,the SISLI and B-tree indexes are analyzed.The experimental results show that the SISLI greatly enhance the performance of the query.Finally,this thesis verifies the correctness and validity of the algorithm.
Keywords/Search Tags:String keys, Machine learning model, Learned index, Secondary index
PDF Full Text Request
Related items