Font Size: a A A

Implementation And Evaluation Of Big Data Parallel Join Algorithms

Posted on:2022-04-08Degree:MasterType:Thesis
Country:ChinaCandidate:W Y XiaFull Text:PDF
GTID:2518306572960149Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the popularity of big data,more and more distributed computing frameworks(such as Hadoop,Spark,etc.)are applied to actual big data applications.In order to solve the core problems of data management under the big data platform,it is very necessary to extend the core query operations in the traditional data management system to the big data platform.The existing Spark built-in query operations are designed for the built-in data types of Spark SQL and the implemented algorithms are not comprehensive;on the other hand,the comprehensive evaluation work for the core query operations of the Spark platform for data management is not much and it is difficult to meet the specific environment.The evaluation needs.In response to the above problems,this article mainly studies the implementation and evaluation of parallel connection algorithms on the Spark platform.The goal is to design and implement a more applicable connection algorithm library on the Spark platform,and give the evaluation results for the experimental platform used.First of all,work on the most common equivalent joins in the database.This article introduces several optimization algorithms for equivalent joins: Broadcast Hash Join,Shuffle Hash Join and Sort Merge Join respectively,and then uses Spark RDD respectively.Realize,and conduct experiments through a series of large and small data,through the different performance indicators shown in the experiment,and then draw the applicable scenarios of these optimization algorithms.Then,evaluate the common solutions to the data skew problem that often occurs in big data,pay attention to the data skew phenomenon on the big data platform,and then introduce the solutions to the common data skew problem,and still use Spark RDD to implement it as a Spark cluster The set of executable algorithms in.Finally,experiments are conducted through a series of large and small data,and the results shown by the experiments are used to compare the effects of different solutions.Finally,the complex multi-channel ?-connection algorithm in the connection is studied,and the multi-channel ??-connection algorithm on the traditional and distributed frameworks is transformed into the multi-channel ?-connection algorithm on the Spark cluster,and the algorithm is realized Still using Spark RDD.Finally,experiments were carried out through a series of data of different sizes,and the results showed through the experiments.After analyzing the experimental structure,it can be found that the algorithm has a good effect,which proves the effectiveness of the algorithm.
Keywords/Search Tags:Distributed computing framework, Spark, join algorithm
PDF Full Text Request
Related items