Font Size: a A A

Research On Query Analysis And Optimization Based On Spark System

Posted on:2017-01-08Degree:MasterType:Thesis
Country:ChinaCandidate:L ZhangFull Text:PDF
GTID:2308330485460436Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Currently, we are in the era of big data. The explosive growth of data information makes the traditional technology architecture can not meet the demand of massive data processing. Research and development of large data platform in this environment came into being. The birth of Hadoop leads people to beginning to pay attention to the calculation of MapReduce model, and to find a better performance of the big data platform. Spark computing framework because of using RDD (Resilient Distributed Datasets) data model and memory-based computing model, so that it has better applicability compared with Hadoop. Apache Spark is a fast and general-purpose large data processing engine, and can be very good to execute the SQL statement query, streaming data processing, machine learning and graph calculation.Spark platform in the performance of the query is far higher than the MapReduce cluster. Spark SQL module is a good integration between the relation processing and the function of Spark’s API programming. The SQL statement and the multi data real-time analysis which in MapReduce cluster can not be very good execution, can get very good execution in Spark SQL. But with the increasing of mobile internet technology rapid development, the problems caused by a large number of users and the amout of data is that Spark in the correlation between tables query join operation performance becomes poor, and which results in decreasing the whole system performance. The reason is that a large number of data in the table does not meet the conditions of connection to participate in the shuffle operation stage, leading to a large number of Disk I/O and unnecessary network communication overhead. How to improve the performance of join operation will be the key to the analysis of the mass data of Spark system.First of all, this paper focuses on the research of the Catalyst optimizer in the Spark SQL module which is based on the Spark. Analysing the process which is Catalyst processing the query, and through the parsing, binding, optimization, physical planning query statement and other aspects of the full range to analysis the Catalyst and in-depth understanding its mechanism.Secondly, on the basis of learning from other researchers’excellent strategy and Spark SQL itsself provides broadcast join repartition join and hash join. For Spark system’s tables equi-join operation, proposing a Spark join optimization scheme which is based on partial bloom filter data structure. The join scheme can better pre filter most of the records that do not meet the connection conditions, so that the data amount of the shuffle phase is obviously reduced, and the connection performance between the large tables can be improved effectively.Finally, through the experiments between the Spark join optimization scheme which is proposed in this paper based on partial bloom filter data structure and Spark SQL itself provides hash join operation. Analysing and comparing these two schemes in the network communication overhead, disk I/O overhead and the join operation execution time consumption. The experimental results show that the Spark join optimization scheme which is proposed in this paper based on partial bloom filter data structure, through the table data pre-filter operation in front of the shuffle stage, effectively reduces the big table connection data, so that reducing the network communication overhead, disk I/O overhead and the join operation execution time consumption, improving the efficiency of big table connection operation.
Keywords/Search Tags:Spark, Spark SQL, Catalyst Optimizer, Equi—Join, Partial Bloom Filter, Shuffle, Pre—Filter, Hash Join
PDF Full Text Request
Related items