Font Size: a A A

Design And Implemention Of Data Partition Algorithm Based On SparkSQL

Posted on:2021-01-27Degree:MasterType:Thesis
Country:ChinaCandidate:W W HuangFull Text:PDF
GTID:2428330611499985Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of modern science and technology and the popularization of the Internet,the data size in the area of the Internet and the Internet of things is growing explosively,so that the traditional single node DBMS can no longer handle the requirements of data storage and computing.In such a context,distributed systems and frameworks such as Hadoop and spark are springing up.The distributed computing frameworks with good computing performance and environmental adaptability utilize the distributed clusters to store and process the massive data,leading to the wide usage of them.However,they both have a shortage of data allocation strategy,so based on the single-table predicate partition algorithm,this paper present a multi-table predicate partition algorithm.Firstly,this paper makes a brief research and analysis on the operation-principle and two data allocation strategy of spark.And then,we study a data partition algorithm of single table predicate relation,predicate-based reference partition schema.On the basisi of it,we propose a multi-table predicate partition algorithm: multi-table predicate-based partition schema.In order to show these two partition shema conveniently,we put forward the concept of two partition schema graph.Based on the two different partition schema graph,we design and implement the data loading and partition algorithms for each of them,and respectively explain how these two partition schemas could optimize the equivalent join on data query.For the latter partition schema,we propose a greedy algorithm to generate the sequence of partition schemes for the convenience of experimental testing.Finally,through a series of comparative experiments,we verify the effectiveness of two partition schemas present in this paper,compare and analyze the query time,loading time,space efficiency and such performance indicators of two different partition stragecies.The experimental results show that our algorithm can truely improve the query efficiency on the distributed database.In the end,we summarize the content of this paper,analyze the shortcomings of this paper and put forward some ideas for further research.
Keywords/Search Tags:massive data, Spark, data allocation
PDF Full Text Request
Related items