Font Size: a A A

Load Balanced Data Loading And Fault Recovery For Log-structured Stores

Posted on:2021-02-05Degree:MasterType:Thesis
Country:ChinaCandidate:G H DingFull Text:PDF
GTID:2428330620468201Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Recently,in the era of data flooding,from e-commerce to social networks and other applications of mobile Internet technology,users have generated a large number of data on an unprecedented scale.The traditional method of solving the problem of database scalability by adding servers and using database sharding requires a lot of manual maintenance costs and hardware overhead.In order to reduce overhead and various problems caused by database sharding,the industry usually replaces the original system with a new database system.Among them,the database system based on log structure merge tree(such as OceanBase)is widely used,and the data blocks stored on the disks of such systems generally exhibit a globally ordered feature.Massive data needs to be loaded into the new database system when replace tradi-tional database,and node or loading process may failed during the process of loading.In order to reduce the total load time and recovery time,we propose a load balanced and supports efficient recovery data loading method.In order to support load balanced loading,we pre-calculate the number of partitions,which is based on the storage block size of the target system and the load file.In order to avoid high overhead by the global sampling and random or head sample selection method to determine the split point,we select some sampling blocks and sample at interval determines the split point between partitions.In order to deal with different types of failures and speed up recovery,we take advantage of the multiple replicas of the LSM-based system to reduce the amount of data retrieved from the remote data source during recovery,and propose replica-based partial failure recovery,avoiding recovering failures by restart-based complete reloading.The main contributions are summarized as follows:1.We propose a load balanced data loading method that pre-calculates the number ofpartitions and determines the split points between partitions based on partial sampling.In this paper,the data stored in the log structured storage system is di-vided into multiple fixed-size sub-tables and distributed on multiple storage nodes in a totally ordered way.The pre-calculated partition number and partial sampling based on the sample selected at equal intervals method reduce sampling overhead, and at the same time make each partition balanced to achieve load balancing data format conversion and data migration to the target storage system.2.A loading method of replica-based partial failure recovery is proposed to re-duce the time for failure recovery.In a distributed environment,in order to enable the data loading process to automatically handle the faults and reduce the recovery time,this paper proposes a local fault recovery method based on replicas according to the characteristics of multiple replicas of a log structured storage system,which reduce the time to recover from a failure because reducing the number of re-pull replicas from the data source in case of a failure.3.The experiments based on Hadoop and the open source database CEDAR ver-ify the efficiency of load balanced data loading and the recovery method proposed in this paper.By comparing the two methods of pre-determined and pre-calculated number of partitions,it is verified that the pre-calculated partition method pro-posed for the LSM-based storage system is more efficient.Besides,by comparing the two methods of global sampling and partial sampling,it is verified that the sampling cost and precision can be balanced through partial sampling proposed.In addition,by comparing the three sample selection methods,it is verified that the interval sample selection method is more suitable for locally ordered data sets.In terms of recovery,for node and load process failures,by comparing the restart-based global failure recovery and replica-based local failure recovery,it is verified that replica-based partial failure recovery can shorten the time of recovery owing to reduction of the number of replicas for recovery retrieved from the data source.In summary,this paper mainly studies the problem of data loading in log-structured storage system.Firstly,aiming at the problem of load balancing in the process of loading massive data,this paper proposes a data partition scheme which is designed according to the characteristics of the storage system structure and local order of loading data files,and achieves the overall load balancing of data loading.Secondly,in order to solve the problem of fault recovery in the process of data loading,this paper proposes a loading method based on the local fault recovery method of the replica combining the characteristics of multiple replicas of the system and reduces the time of recovery.Finally,we verify the efficiency of the method in this paper through experiments.
Keywords/Search Tags:data loading, load balance, fault recovery, log-structured
PDF Full Text Request
Related items