Implementation And Evaluation Of Big Data Parallel Join Algorithms

Posted on:2022-04-08

Degree:Master

Type:Thesis

Country:China

Candidate:W Y Xia

Full Text:PDF

GTID:2518306572960149

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the popularity of big data,more and more distributed computing frameworks(such as Hadoop,Spark,etc.)are applied to actual big data applications.In order to solve the core problems of data management under the big data platform,it is very necessary to extend the core query operations in the traditional data management system to the big data platform.The existing Spark built-in query operations are designed for the built-in data types of Spark SQL and the implemented algorithms are not comprehensive;on the other hand,the comprehensive evaluation work for the core query operations of the Spark platform for data management is not much and it is difficult to meet the specific environment.The evaluation needs.In response to the above problems,this article mainly studies the implementation and evaluation of parallel connection algorithms on the Spark platform.The goal is to design and implement a more applicable connection algorithm library on the Spark platform,and give the evaluation results for the experimental platform used.First of all,work on the most common equivalent joins in the database.This article introduces several optimization algorithms for equivalent joins: Broadcast Hash Join,Shuffle Hash Join and Sort Merge Join respectively,and then uses Spark RDD respectively.Realize,and conduct experiments through a series of large and small data,through the different performance indicators shown in the experiment,and then draw the applicable scenarios of these optimization algorithms.Then,evaluate the common solutions to the data skew problem that often occurs in big data,pay attention to the data skew phenomenon on the big data platform,and then introduce the solutions to the common data skew problem,and still use Spark RDD to implement it as a Spark cluster The set of executable algorithms in.Finally,experiments are conducted through a series of large and small data,and the results shown by the experiments are used to compare the effects of different solutions.Finally,the complex multi-channel ?-connection algorithm in the connection is studied,and the multi-channel ??-connection algorithm on the traditional and distributed frameworks is transformed into the multi-channel ?-connection algorithm on the Spark cluster,and the algorithm is realized Still using Spark RDD.Finally,experiments were carried out through a series of data of different sizes,and the results showed through the experiments.After analyzing the experimental structure,it can be found that the algorithm has a good effect,which proves the effectiveness of the algorithm.

Keywords/Search Tags:

Distributed computing framework, Spark, join algorithm

PDF Full Text Request

Related items

1	Optimization Scheme And Implementation Of Join Operation In Spark Computing Engine
2	Research And Implementation Of Similarity Join For Big Data
3	Parallel Research On Data Mining Algorithm Based On YARN And Spark Framework
4	Research On Query Analysis And Optimization Based On Spark System
5	Implementation And Optimization For Join Operation In Spark
6	Reseach On Optimizing Top-k Join Queries Based On Spark
7	Research On Cardinalities Estimation Of Two Table For Join Operator Based On Spark SQL Platform
8	Research On Apache Spark Distributed Parallel Computing Framework Optimization Technology
9	The Conversation Corpus Management System Based On Spark
10	Optimizing Big Data Equi-join In Spark And Its Application In Analysis Of Network Traffic Data