Study And Implementaion Of SPARK SQL Query Optimization

Posted on:2018-11-16

Degree:Master

Type:Thesis

Country:China

Candidate:K Z Ding

Full Text:PDF

GTID:2348330518496295

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

The popularization of information technology and the rapid rise of mobile internet have brought unprecedented "large data" era. The rapid explosion of data volumes makes the use and research of data face significant challenges. How to drill, excavate, and acquire the value of data in mass data sets becomes particularly critical. Therefore, recent years, in large data technology, MapReduce has become the core framework of many large data systems. The full use of parallel computing advantages, help us to realize high-efficiency mass data processing, so these system have been widely used.As an important tool in big data age, the SQL-on-Hadoop system not only possesses the advantages of SQL: simple and easy to use, but also possesses the advantages of Hadoop system, which can handle massive data and excavate the potential value in big data. However, as the most representative SQL-on-Hadoop system, Spark SQL when facing queries on massive data (TBs), still cannot get the results in a short time, the time delay will drastically reduce the user experience. Therefore, how to improve the query efficiency of Spark SQL system is the current research hotspot. Aiming at the deficiency of the current Spark SQL system, this paper proposes a general optimization method on SQL-on-Hadoop system and contributes the corresponding implementation. Firstly, by analyzing the query scenes, we rewrite the original data to a column storage format called Parquet to better deal with the analytic query transaction. Secondly, in the meta-data generation phase, we add Bloom Filter and Histogram into it; Thirdly, in the Predicate Pushdown phase,we improve the efficiency of invalid data filtering by leveraging the Bloom Filter and Histogram. Finally, for the LIMIT query, we enable early stop during the execution phase of the spark job by sorting, to reduce the query time. The optimization method in this paper begins with the storage of the underlying format, focusing on strengthening the efficiency of Data Skipping, improving the Spark SQL query efficiency systematically and effectively.At the beginning, according to the insufficiency of the current Spark SQL system, the paper analyses the main work content of the query optimization; Moreover, this paper study and introduces the detailed design of the query optimization on Spark SQL system. This part shows the division of every function module and detailed implementation of the whole system. Finally, this paper show the experiment results of the proposed method, and verified its completeness and validity.

Keywords/Search Tags:

Spark SQL, Bloom Filter, Histogram, Data Skipping

PDF Full Text Request

Related items

1	Research And Application Of SQL Join Optimization Based On Spark
2	Research On Equi-Join Optimization Algorithms On Spark SQL
3	Research On Query Analysis And Optimization Based On Spark System
4	Research And Application Of Data Deduplication Technology Based On Bloom Filter
5	Researches And Applications On Efficient Bloom Filter For Big Data
6	Privacy Preserved Bloom Filter And Key-value Based Bloom Filter
7	Multi-Bloom-Filter Query Algorithms And Their Applications
8	Research And Application Of Bloom Filter In Duplicated Webpages Deletion
9	The Researches And Applications On Bloom Filter Query Algorithm
10	The Optimization And Application Of Big Data Query Based On Bloom Filter