Font Size: a A A

Study And Implementaion Of SPARK SQL Query Optimization

Posted on:2018-11-16Degree:MasterType:Thesis
Country:ChinaCandidate:K Z DingFull Text:PDF
GTID:2348330518496295Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The popularization of information technology and the rapid rise of mobile internet have brought unprecedented "large data" era. The rapid explosion of data volumes makes the use and research of data face significant challenges. How to drill, excavate, and acquire the value of data in mass data sets becomes particularly critical. Therefore, recent years, in large data technology, MapReduce has become the core framework of many large data systems. The full use of parallel computing advantages, help us to realize high-efficiency mass data processing, so these system have been widely used.As an important tool in big data age, the SQL-on-Hadoop system not only possesses the advantages of SQL: simple and easy to use, but also possesses the advantages of Hadoop system, which can handle massive data and excavate the potential value in big data. However, as the most representative SQL-on-Hadoop system, Spark SQL when facing queries on massive data (TBs), still cannot get the results in a short time, the time delay will drastically reduce the user experience. Therefore, how to improve the query efficiency of Spark SQL system is the current research hotspot. Aiming at the deficiency of the current Spark SQL system, this paper proposes a general optimization method on SQL-on-Hadoop system and contributes the corresponding implementation. Firstly, by analyzing the query scenes, we rewrite the original data to a column storage format called Parquet to better deal with the analytic query transaction. Secondly, in the meta-data generation phase, we add Bloom Filter and Histogram into it; Thirdly, in the Predicate Pushdown phase,we improve the efficiency of invalid data filtering by leveraging the Bloom Filter and Histogram. Finally, for the LIMIT query, we enable early stop during the execution phase of the spark job by sorting, to reduce the query time. The optimization method in this paper begins with the storage of the underlying format, focusing on strengthening the efficiency of Data Skipping, improving the Spark SQL query efficiency systematically and effectively.At the beginning, according to the insufficiency of the current Spark SQL system, the paper analyses the main work content of the query optimization; Moreover, this paper study and introduces the detailed design of the query optimization on Spark SQL system. This part shows the division of every function module and detailed implementation of the whole system. Finally, this paper show the experiment results of the proposed method, and verified its completeness and validity.
Keywords/Search Tags:Spark SQL, Bloom Filter, Histogram, Data Skipping
PDF Full Text Request
Related items