Research And Optimization Of Data Placement Method In Spark Partitioner

Posted on:2021-03-28

Degree:Master

Type:Thesis

Country:China

Candidate:R Wu

Full Text:PDF

GTID:2428330605473923

Subject:Agriculture

Abstract/Summary:

PDF Full Text Request

With the rapid development and progress of society,people's lifestyles have also undergone tremendous changes from the arrival of the era of big data,which not only spawned many new industries,but also permeated big data technology into all walks of life,which not only promoted the efficient development of society,but also brought convenience to people's lives.How to process the generated massive data quickly is also a problem that cannot be ignored.According to Intel's forecast,the total amount of global data will reach 44ZB in 2020,while the amount of data generated by China will reach 8ZB,accounting for about one fifth of the total global data volume.Therefore,more and more data needs to be processed today,and it is urgent for us to make fast and effective processing of massive data.Spark,as a fast calculation engine,has become a mainstream big data processing platform.Spark's efficiency depends on the nature of in-memory computing on the one hand,and has a close relationship with the parallelism brought about by partitions on the other.However,when the data repetition rate is large,using Spark's default hash partitioning algorithm to process data will result in every the amount of data in each partition is uneven,and in extreme cases,some partitions have all the RDD data,so the skew of the partition will cause problems such as uneven distribution of resources in the big data cluster system and inefficient job execution.The main research contents and work of this thesis are focused on the following aspects:(1)In this thesis,I have designed and implemented three types of hash partitioners that optimize hash partitioning:random number partitioning,random number and quadratic allocation,and three adjacent partitioning methods.After comparing them experimentally and do the ordinary text file input data required by the content is partitioned,I found that the job execution efficiency is significantly improved compared with the default hash partitioning method.(2)By comparing different data inclination,I found that three types of optimized partitioners solve the problem of partition skew when the default hash partitioner has a large number of key repetitions,and make the skewed data distributed to each partition more evenly,thereby improving computing efficiency.The research results show that the three optimized partitioners propose optimized solutions to the data skew problem,and also improve the operating efficiency of the system,which is of great enlightening significance to the improvement of the data placement scheme in Spark partitions.

Keywords/Search Tags:

Spark, Partition, Hash partitioner, Partition skew

PDF Full Text Request

Related items

1	Research Of Task Partition And Resource Allocation Algorithms For Load Balance In Spark Computing Environment
2	Research And Implementation On Anti-skew Spark Intermediate Data Partition Mechanism
3	Research On Optimization Methods Of Dynamic Equilibrium Partition Method For Data Skew In Spark Shuffle
4	Research On Partition Loading Balance Based On Spark Data Skew
5	Research Of Data Skew On Spark Based On Imporved Partition Method
6	Research On Data Skew Optimization In Spark Computing Framework
7	Design And Implementation Of Hive On Spark Dynamic Partition Pruning
8	Research On Task Execution Optimization In Spark
9	Research On Shuffle Mechanism In Spark Cluster
10	Diversified Image Retrieval Based On Non-uniform Partition Matroid Constraints