Font Size: a A A

Research And Optimization Of Data Placement Method In Spark Partitioner

Posted on:2021-03-28Degree:MasterType:Thesis
Country:ChinaCandidate:R WuFull Text:PDF
GTID:2428330605473923Subject:Agriculture
Abstract/Summary:PDF Full Text Request
With the rapid development and progress of society,people's lifestyles have also undergone tremendous changes from the arrival of the era of big data,which not only spawned many new industries,but also permeated big data technology into all walks of life,which not only promoted the efficient development of society,but also brought convenience to people's lives.How to process the generated massive data quickly is also a problem that cannot be ignored.According to Intel's forecast,the total amount of global data will reach 44ZB in 2020,while the amount of data generated by China will reach 8ZB,accounting for about one fifth of the total global data volume.Therefore,more and more data needs to be processed today,and it is urgent for us to make fast and effective processing of massive data.Spark,as a fast calculation engine,has become a mainstream big data processing platform.Spark's efficiency depends on the nature of in-memory computing on the one hand,and has a close relationship with the parallelism brought about by partitions on the other.However,when the data repetition rate is large,using Spark's default hash partitioning algorithm to process data will result in every the amount of data in each partition is uneven,and in extreme cases,some partitions have all the RDD data,so the skew of the partition will cause problems such as uneven distribution of resources in the big data cluster system and inefficient job execution.The main research contents and work of this thesis are focused on the following aspects:(1)In this thesis,I have designed and implemented three types of hash partitioners that optimize hash partitioning:random number partitioning,random number and quadratic allocation,and three adjacent partitioning methods.After comparing them experimentally and do the ordinary text file input data required by the content is partitioned,I found that the job execution efficiency is significantly improved compared with the default hash partitioning method.(2)By comparing different data inclination,I found that three types of optimized partitioners solve the problem of partition skew when the default hash partitioner has a large number of key repetitions,and make the skewed data distributed to each partition more evenly,thereby improving computing efficiency.The research results show that the three optimized partitioners propose optimized solutions to the data skew problem,and also improve the operating efficiency of the system,which is of great enlightening significance to the improvement of the data placement scheme in Spark partitions.
Keywords/Search Tags:Spark, Partition, Hash partitioner, Partition skew
PDF Full Text Request
Related items