A Big Data Analyzing Facility Based On Spark Supporting Standard SQL Grammar

Posted on:2018-09-02

Degree:Master

Type:Thesis

Country:China

Candidate:C Zhang

Full Text:PDF

GTID:2428330596490065

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

In recent years,big data technologies have drawn significant attentions from both academic and industrial communities.It is a great challenge to capture and analyze the intensive data sets.Google's MapReduce has been proved as a good framework to solve parts of these problems.However,low efficiency of data processing has been exposed on more and more MapReduce applications.To cope with that issue,some researchers propose Spark as an alternative.As a new module in Spark,Spark SQL integrates relational processing with Spark's functional programming libraries.Spark SQL which provides support for executing relational queries to access big data,has become one of the most widely used modules in Apache Spark.However,neither the SQLContext nor HiveContext that Spark SQL provides is perfect.And their deficiencies are reflected in two aspects.First,standard SQL grammar is not supported.It takes clients some time to adapt to query languages.Some survey results show that almost all big data application researchers and developers have rich experience of using traditional database like MySQL or Oracle and turn to big data ecosystem later,they are more familiar with standard SQL grammar.As a result,it will be better for them if they can reduce the learning cost and use standard grammar directly.Second,Spark SQL has several functional defects,some frequently used data types and functionalities are not supported.The lack of functionality will induce users have to spend more effort to seek for other workarounds.These disadvantages will lower efficiency a lot.In this paper we present FlintStone,a new resolving and analyzing facility based on Spark.Clients could finish multiple computing operations on structured data sets.Compared with current Spark SQL,FlintStone has three main advantages.First,FlintStone can recognize SQL:1999 Standard grammar,which will facilitate users familiar with traditional relational databases to write queries in standard SQL language on Spark platform.Second,it seamlessly integrates more common used data types and functionalities.Third,FlintStone will optimize physical plan before submitting Spark job to cluster.FlintStone can be seen as a good enhancement to Spark by bridging the gap between SQL standard DML(Data manipulation language)and that supported by Spark.At the end of the thesis,we evaluate FlintStone against two Contexts from Spark SQL based on different test suites.The results show that the FlintStone does better in both standard SQL grammar recognition and execution performance.Especially,FlintStone can outperform origin Spark SQL by around 10% in join type jobs.Currently,we have finished the first version of FlintStone.And it has been contributed to Spark community by open-sourced in github.FlintStone is also supported and maintained by experts from Intel big data departments.

Keywords/Search Tags:

Big Data, Apache Spark, Spark SQL, Standard SQL Grammar, Facility

PDF Full Text Request

Related items

1	OCTWAS - Online Check-pointer for Workflows on Apache Spark
2	Using apache spark for scalable gene sequence analysis
3	Research On Taxi Trajectory Organization Method Based On Apache Spark
4	Research On Apache Spark Distributed Parallel Computing Framework Optimization Technology
5	Research On The Discretization Algorithm Of Big Data Based On Spark
6	Design And Implementation Of A Performance Modeling System On Apache Spark
7	Research On K-Prototypes Algorithm Based On Mixed Data And Implementation Of Spark Platform
8	The Design And Implement Of Data Source Connector Based On Spark SQL
9	A System For Distributed MD Data Analysis Based On Spark
10	Enhanced Singular Collaborative Filtering Based Recommender System On Apache Spark