Design And Implemention Of Data Partition Algorithm Based On SparkSQL

Posted on:2021-01-27

Degree:Master

Type:Thesis

Country:China

Candidate:W W Huang

Full Text:PDF

GTID:2428330611499985

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of modern science and technology and the popularization of the Internet,the data size in the area of the Internet and the Internet of things is growing explosively,so that the traditional single node DBMS can no longer handle the requirements of data storage and computing.In such a context,distributed systems and frameworks such as Hadoop and spark are springing up.The distributed computing frameworks with good computing performance and environmental adaptability utilize the distributed clusters to store and process the massive data,leading to the wide usage of them.However,they both have a shortage of data allocation strategy,so based on the single-table predicate partition algorithm,this paper present a multi-table predicate partition algorithm.Firstly,this paper makes a brief research and analysis on the operation-principle and two data allocation strategy of spark.And then,we study a data partition algorithm of single table predicate relation,predicate-based reference partition schema.On the basisi of it,we propose a multi-table predicate partition algorithm: multi-table predicate-based partition schema.In order to show these two partition shema conveniently,we put forward the concept of two partition schema graph.Based on the two different partition schema graph,we design and implement the data loading and partition algorithms for each of them,and respectively explain how these two partition schemas could optimize the equivalent join on data query.For the latter partition schema,we propose a greedy algorithm to generate the sequence of partition schemes for the convenience of experimental testing.Finally,through a series of comparative experiments,we verify the effectiveness of two partition schemas present in this paper,compare and analyze the query time,loading time,space efficiency and such performance indicators of two different partition stragecies.The experimental results show that our algorithm can truely improve the query efficiency on the distributed database.In the end,we summarize the content of this paper,analyze the shortcomings of this paper and put forward some ideas for further research.

Keywords/Search Tags:

massive data, Spark, data allocation

PDF Full Text Request

Related items

1	Design And Implementation Of The Massive Data Computing Platform Based On Spark
2	A Frequent Serial Episode Mining Algorithm With Time Constraints Based On Spark Platform
3	Research On Resource Dynamic Allocation Technology On Spark Data Processing Framework
4	Design And Implementation Of Telecom 4G Big Data Platform For Network Optimization Based On Spark
5	Research On The In-Memory Data Management Technology On Spark Data Processing Framework
6	Research On Dynamic Placement Of RDD Data For Interactive Spark Applications
7	Research On SPARK Based Massive Data Frequent Pattern Mining Algorithms
8	Research On Efficient Storage,Query And Cluster Analysis Of Massive Spatiotemporal Data
9	The Research Of Big Data Manipulating Technology Based On Spark
10	Research On Association Rules Algorithm For Massive Telecommunication Network Alarm Data Based On Spark