Font Size: a A A

Research On High-performance Storage Strategy For Multi-source Heterogeneous Time Series Data In HBase

Posted on:2020-10-31Degree:MasterType:Thesis
Country:ChinaCandidate:Z ZhangFull Text:PDF
GTID:2428330623451213Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of Internet of things,5G,artificial intelligence and other technologies,in the comprehensive information application systems such as network security situational awareness,intelligent manufacturing,and smart city,the demand for heterogeneous sequential log data from different sources is exploding.The traditional database system cannot cope with the unified storage requirements of these massive heterogeneous data.As a distributed column family database,HBase can solve the problem of mass data storage in such application scenarios due to its good scalability.At present,the partition storage mechanism adopted by HBase and the related load balancing strategy can enable regions on each Region server to have roughly the same number of regions,but the data access requests on regions are not equal,which is easy to cause the data load skew problem,thus greatly affecting its read-write performance and unable to be effectively applied to such application scenarios.In this paper,the load skew problem caused by uneven data access request in HBase is studied and analyzed in depth with the application scenario of mass multi-source heterogeneous time series data storage with high efficiency.To solve the above problems,this paper first designs a distributed storage strategy based on user access behavior prediction.The strategy on the basis of HBase original partition storage,through to the user access behavior modeling,which can realize the prediction data of cold and hot,and combined with the feature of data related to time and space of Rowkey assembly plan,complete the hot and cold stratification of data partition storage,thus the system data request load equalization and eventually optimize data read performance systematically.This paper improves the existing optimization scheme of constructing secondary index for hot data.By modifying the index and the master data Rowkey assembly scheme,the system stores the index data and the corresponding master data in the corresponding Region synchronously,so as to realize load balancing of the index data access request.Finally,combined with the above strategies and algorithms,this paper designs and implements a distributed storage prototype system for multi-source heterogeneous time series data,and gives the corresponding design and implementation details.In this paper,the KDD CUP 99 network intrusion detection security data set is used to generate the corresponding load data set by simulating user access behavior,and an experimental platform is built to test the feasibility of the algorithm and the system.Experimental results show that PUB-HBase can effectively disperse data access requests among nodes on the premise of uniform distribution of data compared with existing single table HBase and HBase with pre-partitioning strategy,so as to effectively shorten the user's query time for hot data and reduce the extra I/O cost of the system due to data load skew.
Keywords/Search Tags:Data Storage, Hot and Cold Stratification, Load Balancing, Multi-Source Heterogeneity, Big Data, Machine Learning
PDF Full Text Request
Related items