Font Size: a A A

Research And Application Of MemSql Data Partitioning Strategy For Spark

Posted on:2019-03-28Degree:MasterType:Thesis
Country:ChinaCandidate:L W MengFull Text:PDF
GTID:2428330590975435Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The development of big data has directly promoted the development of various distributed storage frameworks.Excellent distributed storage frameworks such as HDFS,HBASE,and MemSql have emerged one after another.However,the combination of numerous storage frameworks and Spark has unreasonable data partitions and leads to unbalanced cluster load,which reduces the response speed of data analysis.A partition is the smallest unit of storage in MemSql,in which data is stored in a hash table.The Spark-MemSql framework has two application scenarios: a independent framework and an integration framework.In the independent framework,using the method of reading data from other places,using the default primary key as the partition key or randomly selecting the partition key,there may be a large number of Hash conflicts or no index effect,which reduces the speed of Spark batch query MemSql.Therefore,in this thesis,a partition key selection strategy KSAWA to speed up the access speed of Spark is proposed.Under the integrated framework,local read data analysis is used to distribute the same number of partitions to each node.Due to heterogeneity of node resources,utilization of system resources and parallelism of application analysis may be reduced.For this reason,in this thesis,a data dynamic partitioning mechanism and strategy based on node load to improve system load balance and improve application response speed is proposed.The main content of the paper is as follows:(1)For the independent framework application scenario,a partition key selection strategy KSAWA is proposed.This strategy is based on multiple candidate keys,taking into account factors such as the skewness,different query mode weights,query frequency,etc.according to the needs of users of the query application to select the partition key,can reduce the skewness caused by Hash conflict,and make full use of the role of the index to speed up Spark's access to MemSql.(2)According to the application scenario of integration framework,a data dynamic partitioning mechanism and strategy based on node load are proposed.The dynamic data partitioning mechanism based on node load comprises of such as load monitoring acquisition,forecasting,data pre-partitioning,and data migration;the data partition strategy based on node load uses quadratic smoothing method to predict the node load,combining the AHP and entropy index weighting methods.The corresponding partitioning strategy can be obtained according to different data analysis applications to dynamically adjust the load balance of the system and improve the response speed of the application.(3)Based on the forementioned research results,a prototype of the Spark-MemSql data analysis platform was designed and implemented.Based on different data sets,the development application tests and verifies the prototype system,which shows that the partition strategy under different application scenarios can improve the system load balance and the application response speed.
Keywords/Search Tags:Spark, MemSql, partition key selection, load balancing, dynamic partitioning strategy
PDF Full Text Request
Related items