Research On Parallel Computing Based On Spark

Posted on:2020-06-07

Degree:Master

Type:Thesis

Country:China

Candidate:F Liu

Full Text:PDF

GTID:2428330575985677

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

The rapid development of Internet information technology has brought convenience to people's information exchange,at the same time,it has also made the global data volume explosive.The generation of big data needs to be classified and processed.Map Reduce distributed parallel computing framework,which adapts to the era of big data,is favored by major enterprises because of its low threshold,convenience and effective processing capability of big data.Howerer,the small value density of big data makes enterprises gradually realize that Map Reduce can not meet the timeliness.Under this background,Spark is widely used for more efficient data processing capability.In the meantime,research on its performance improvement has also become a hot spot.This paper first studies and analyzes the mechanism of Spark memory cache.By consulting a large number of relevant literatures and combining with relevant experiments,it is verified that there is improvement space in the LRU cache replacement algorithm used by Spark when there is insufficient memory.Secondly,through research,it is found that the Spark job's access sequence of RDD is affected by the job structure,which indicates that the job structure can be optimized to improve the performance of Spark.After making an intensive study of Spark's job scheduling mechanism,an optimization scheme for Spark job structure is proposed,which can make the RDD reuse behavior in the execution process more concentrated,and the cache hit rate can be improved,thus improving performance.RDD is the core data abstraction of Spark.This paper introduces the concept of weight and defines the cache values for RDD.Establishing the weight model of RDD by using the key attributes of RDD obtained from the job structure analysis process,and an optimised cache replace strategy OCR(Optimised Cache Replace)to displace LRU,and more valuable data is cached in the case of insufficient memory resources.Incresing system performance by increasing cache hit rate.Finally,a variety of algorithms are used as experimental load in a single and mixed form,selecting the existing authoritative data set and data set generated by the data generator as experimental data,adjusting the cluster memory size,number of iterations and so on,then conducting a performance comparison experiment with Spark.The experimental results show the effectiveness of the proposed optimization scheme when the memory resources of Spark cluster are insufficient.

Keywords/Search Tags:

parallel computing, Spark, job structure, cache replace, RDD weight

PDF Full Text Request

Related items

1	Research And Implementation Of Memory Optimization Based On Parallel Computing Engine Spark
2	Cache Optimizations And Parallel Simulation For Multi-threaded Workloads
3	Research On Memory Data Management Technology In Spark
4	Research On Job Scheduling And Memory Cache Optimization Based On SPARK
5	Study Of MPI/GPU Parallel Computing Processing Mechanism On Spark
6	Research And Implementation Of Cache And Fault-Tolerance Optimization Strategy Based On Spark
7	The Research Of Cache Replace Algorithm And Slot Schedule Policy Under SMT Environment
8	Design And Implementation Of A Distributed Hybrid Index Structure Based On Spark
9	Study On Dynamic Parallel Computing Based On Probabilistic Rough Set And Its Application In Spark Platform
10	Research On Spark Performance Optimization Technology For In-Memory Computing