Font Size: a A A

Structured Data Processing And Performance Optimization Of Spark SQL

Posted on:2020-06-26Degree:MasterType:Thesis
Country:ChinaCandidate:Z LuoFull Text:PDF
GTID:2428330590471714Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years,the Spark memory computing framework has risen rapidly,and the data processing speed has been greatly improved.However,the upper limit of speed is limited by the Spark memory size.When the amount of data is less than or close to memory capacity,Spark has the best performance.Otherwise,it has poor performance.Therefore,Spark SQL exposes many problems when dealing with communication big data represented by 4G industry card data,such as slow read-write speed and query speed,uneven or insufficient system resource allocation,and low efficiency of large table Join.This thesis deals with structured data and performs related performance optimization from three aspects: Spark SQL data organization,Spark resource management mechanism and Join algorithm.Firstly,an improved data organization framework is proposed to improve the read-write and query speed of Spark SQL.Secondly,the resource monitoring model is established to allocate and use the resources reasonably.Finally,the large table Join algorithm is proposed based on the improved data organization framework and monitoring model.The main work is as follows:(1)By analyzing and comparing the data organization methods of Spark SQL and Hbase,an improved data organization framework is proposed in this thesis.The framework first improves the read-write interface of the Parquet file format,and then constructs the secondary index using Hbase+Phoenix,which greatly improves the speed of reading and writing and querying the 4G card data.(2)This thesis further studies the memory model and resource usage of Spark,obtains the underlying parameters of the cluster through performance monitoring,and establishes a memory monitoring model to classify and warn the resource usage.Finally,the warning results are fed back to the subscriber through the observer model,and the subscription module can adjust the data traffic dynamically according to its feedback.(3)This thesis optimizes the large table association algorithm based on the improved data organization framework and monitoring model,and proposes a Join algorithm based on memory monitoring and batch processing.The algorithm dynamically controls the data flow and Join batches through the monitoring model,and accelerates the speed of read-write and query through improved data organization.Experiments show that the algorithm alleviates the problem of insufficient memory to some extent and reduces the load imbalance caused by data skew.The overall running time is better than that of the default Join algorithm.In summary,the performance of Spark SQL processing structured data is affected by data organization and memory model,which is characterized by low efficiency of Join.In this thesis,the data organization framework is firstly improved,and then the memory monitoring model is established.Finally,the Join operation is optimized and the average processing time is reduced by 31.49%.
Keywords/Search Tags:Spark SQL, structured data, Parquet, memory monitoring, Join
PDF Full Text Request
Related items