Font Size: a A A

Research On Parallel Computing Based On Spark

Posted on:2020-06-07Degree:MasterType:Thesis
Country:ChinaCandidate:F LiuFull Text:PDF
GTID:2428330575985677Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
The rapid development of Internet information technology has brought convenience to people's information exchange,at the same time,it has also made the global data volume explosive.The generation of big data needs to be classified and processed.Map Reduce distributed parallel computing framework,which adapts to the era of big data,is favored by major enterprises because of its low threshold,convenience and effective processing capability of big data.Howerer,the small value density of big data makes enterprises gradually realize that Map Reduce can not meet the timeliness.Under this background,Spark is widely used for more efficient data processing capability.In the meantime,research on its performance improvement has also become a hot spot.This paper first studies and analyzes the mechanism of Spark memory cache.By consulting a large number of relevant literatures and combining with relevant experiments,it is verified that there is improvement space in the LRU cache replacement algorithm used by Spark when there is insufficient memory.Secondly,through research,it is found that the Spark job's access sequence of RDD is affected by the job structure,which indicates that the job structure can be optimized to improve the performance of Spark.After making an intensive study of Spark's job scheduling mechanism,an optimization scheme for Spark job structure is proposed,which can make the RDD reuse behavior in the execution process more concentrated,and the cache hit rate can be improved,thus improving performance.RDD is the core data abstraction of Spark.This paper introduces the concept of weight and defines the cache values for RDD.Establishing the weight model of RDD by using the key attributes of RDD obtained from the job structure analysis process,and an optimised cache replace strategy OCR(Optimised Cache Replace)to displace LRU,and more valuable data is cached in the case of insufficient memory resources.Incresing system performance by increasing cache hit rate.Finally,a variety of algorithms are used as experimental load in a single and mixed form,selecting the existing authoritative data set and data set generated by the data generator as experimental data,adjusting the cluster memory size,number of iterations and so on,then conducting a performance comparison experiment with Spark.The experimental results show the effectiveness of the proposed optimization scheme when the memory resources of Spark cluster are insufficient.
Keywords/Search Tags:parallel computing, Spark, job structure, cache replace, RDD weight
PDF Full Text Request
Related items