Font Size: a A A

Performance Optimization Of Log-structured Merge Tree In Database System

Posted on:2022-12-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:X L HeFull Text:PDF
GTID:1488306773482764Subject:Computer Software and Application of Computer
Abstract/Summary:PDF Full Text Request
The storage model is the basis for the stable running of the online system,providing efficient storage and retrieval functions for various critical application operations,and is an indispensable component of the database management system.With the rapid iteration of mobile Internet,cloud computing industry,Internet of Things and other industries,data storage is growing exponentially.This phenomenon poses a great challenge to the storage model in database systems.The storage model based on Log-Structured Merge-Tree(LSM-tree)adopts the Out-of-place Update and hierarchical storage management structure,which can aggregate small particles of random writes to process write requests into large particles of continuous reads and writes to make full use of the bandwidth of the persistent storage hardware.It has high throughput capacity and high space utilization and is widely used in the storage model of various database systems.As the business is constantly updated and iterated and the application load is diversified,the workload faced by database systems is gradually becoming more complex and diversified.As a writeoptimized data structure,the existing research work still has some shortcomings and room for further optimization when facing the complex and diversified load of database systems.First,the asynchronous update of the LSM-tree allows the database system to have more possibilities in handling update operations.The large amount of invalid old version data generated by update operations has a large impact on disk space utilization,read and write performance,and the existing garbage collection mechanism determines that the LSM-tree cannot immediately eliminate them from the system.On the other hand,database systems usually use incremental storage or full storage as the only data storage method for update operations,and it is worth studying how to select the appropriate update method for each update operation at a fine-grained level.Secondly,the frequent triggering of compaction operations in the LSM-tree will cause large fluctuations in the write performance of the system,while the existing optimization methods usually improve the write performance of the system at the expense of query performance.Therefore,improving the range query performance of the LSM-tree is the main optimization goal of the database system while ensuring stable write performance.Finally,with the explosive growth of data volume,the storage system architecture of a single machine gradually evolves to a distributed architecture.In the highly concurrent scenario of continuous batch data ingestion,the main problem faced by distributed database systems implemented based on LSM-trees is that each storage node bears different merge pressure,resulting in applications that cannot fully utilize the high throughput and high parallelism capabilities of distributed systems to complete data ingestion tasks.Although some existing load balancing algorithms can reduce the unbalanced merge load to some extent,these methods usually consider a single influence factor and are not applicable to various unbalanced scenarios.Therefore,the performance of the LSM-tree in database systems under various different types of workloads still deserves further research and exploration,both under standalone and distributed architectures.Based on the above three key issues,the main work and contributions of this paper are summarized as follows:(1)Aiming at the problem of garbage collection caused by the update operation in the LSM-tree and how to choose the update method for the update operation,this paper proposes a fast garbage collection mechanism and designs an adaptive update selection strategy.First,to reduce the survival time of a large amount of invalid old version data in the disk,this paper designs a fast garbage collection mechanism.This mechanism separates the two parts of garbage collection,identifying invalid old version data and deleting old version data,and provides flexibility for the system in controlling write amplification,space amplification,and other issues.And it can ensure that each logically deleted data can be permanently deleted after only one merge operation.On the other hand,an adaptive judgment mechanism is designed in this paper for whether to choose incremental storage or full storage for update operations.This mechanism can select the appropriate update method according to the characteristics of each update operation,thus achieving a good trade-off between write amplification and read amplification of the system.(2)To address the problem of mutual constraints on read and write performance in mixed load,this paper designs TargetedKV,a read and write balanced keyvalue database based on key-value separation technology.To maintain stable and efficient write performance without degrading the performance of range queries,this paper first adopts a key-value storage separation scheme at the SSTable level.This scheme supports the maintenance of only fully ordered key and partially ordered value for each layer of the disk component when the log structure merge tree reorganizes the data,thus achieving lower write amplification while meeting the latency requirements of range queries.In addition,this paper also divides the data of the disk component into multiple data partitions according to the primary key order and uses the localized feature of query load to guide the merge operation,thus enabling fast aggregation of data of a certain range in a few files and further improving the performance of range queries.(3)For the problem of unbalanced merge load in distributed database systems implemented based on LSM-trees,this paper proposes a machine learning based balancing framework LeaBalancer.Distributed database systems usually use range partitioning to achieve horizontal scaling function,and this paper considers various factors affecting the data merging efficiency of each range partition,and fuses these factors together to predict the time to complete data merging for each data partition by machine learning method.Based on the prediction results,this paper designs a single-node scheduling strategy and a scheduling strategy among multiple nodes,respectively.With the two-tier scheduling scheme,LeaBalancer can balance the merging load between nodes with as little data migration as possible.In summary,this paper thoroughly investigates the shortcomings of the research work on garbage collection,range query,and load balancing of the storage model implemented in the database system based on the LSM-tree,and proposes three optimization schemes based on the characteristics of different scenarios.The first two works mainly address the read/write performance balancing problem of key-value database under single-computer architecture and the garbage collection and update data storage problems caused by update operations.The last work focuses on the unbalanced merge load problem of the merge tree of log structure in distributed architecture,and gives the corresponding scheduling scheme.A large number of experiments based on different workloads verify the effectiveness of the optimization scheme proposed in this paper.
Keywords/Search Tags:Database System, LSM-tree, Update-intensive Workload, Garbage Collection, Mixed Workload, KV Separation, Load Balancing, AI4DB
PDF Full Text Request
Related items