Structured Data Processing And Performance Optimization Of Spark SQL

Posted on:2020-06-26

Degree:Master

Type:Thesis

Country:China

Candidate:Z Luo

Full Text:PDF

GTID:2428330590471714

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

In recent years,the Spark memory computing framework has risen rapidly,and the data processing speed has been greatly improved.However,the upper limit of speed is limited by the Spark memory size.When the amount of data is less than or close to memory capacity,Spark has the best performance.Otherwise,it has poor performance.Therefore,Spark SQL exposes many problems when dealing with communication big data represented by 4G industry card data,such as slow read-write speed and query speed,uneven or insufficient system resource allocation,and low efficiency of large table Join.This thesis deals with structured data and performs related performance optimization from three aspects: Spark SQL data organization,Spark resource management mechanism and Join algorithm.Firstly,an improved data organization framework is proposed to improve the read-write and query speed of Spark SQL.Secondly,the resource monitoring model is established to allocate and use the resources reasonably.Finally,the large table Join algorithm is proposed based on the improved data organization framework and monitoring model.The main work is as follows:(1)By analyzing and comparing the data organization methods of Spark SQL and Hbase,an improved data organization framework is proposed in this thesis.The framework first improves the read-write interface of the Parquet file format,and then constructs the secondary index using Hbase+Phoenix,which greatly improves the speed of reading and writing and querying the 4G card data.(2)This thesis further studies the memory model and resource usage of Spark,obtains the underlying parameters of the cluster through performance monitoring,and establishes a memory monitoring model to classify and warn the resource usage.Finally,the warning results are fed back to the subscriber through the observer model,and the subscription module can adjust the data traffic dynamically according to its feedback.(3)This thesis optimizes the large table association algorithm based on the improved data organization framework and monitoring model,and proposes a Join algorithm based on memory monitoring and batch processing.The algorithm dynamically controls the data flow and Join batches through the monitoring model,and accelerates the speed of read-write and query through improved data organization.Experiments show that the algorithm alleviates the problem of insufficient memory to some extent and reduces the load imbalance caused by data skew.The overall running time is better than that of the default Join algorithm.In summary,the performance of Spark SQL processing structured data is affected by data organization and memory model,which is characterized by low efficiency of Join.In this thesis,the data organization framework is firstly improved,and then the memory monitoring model is established.Finally,the Join operation is optimized and the average processing time is reduced by 31.49%.

Keywords/Search Tags:

Spark SQL, structured data, Parquet, memory monitoring, Join

PDF Full Text Request

Related items

1	Optimizing Big Data Equi-join In Spark And Its Application In Analysis Of Network Traffic Data
2	Implementation And Optimization For Join Operation In Spark
3	Optimization Of Database Join Algorithms On DRAM/NVM-Based Hybrid Memory
4	Research On Query Analysis And Optimization Based On Spark System
5	Optimization Scheme And Implementation Of Join Operation In Spark Computing Engine
6	Research On Cardinalities Estimation Of Two Table For Join Operator Based On Spark SQL Platform
7	Implementation And Evaluation Of Big Data Parallel Join Algorithms
8	Research On Equi-Join Optimization Algorithms On Spark SQL
9	Reseach On Optimizing Top-k Join Queries Based On Spark
10	Research And Implementation Of Similarity Join For Big Data