Font Size: a A A

Query Optimization In Spark SQL For Business Data Of 4G Industry Card Based On HDFS

Posted on:2020-07-09Degree:MasterType:Thesis
Country:ChinaCandidate:X D ChenFull Text:PDF
GTID:2428330590471696Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the widely used of Spark SQL and HDFS in big data structured queries field,the speed of data query has been improved obviously,but some problems have been exposed at the same time.For example,the default Block size of HDFS unreasonably affects the query efficiency of Spark SQL,and the efficiency of Spark SQL dealing with massive small files is too low.These problems are especially noticeable when dealing with communication data represented by 4G industry application cards(IAC)data.This thesis mainly studies the distributed storage of 4G IAC data on HDFS,and improves the query efficiency of Spark SQL by optimizing the storage of data on HDFS,including dynamically setting HDFS Block size and the processing of massive 4G IAC small files.Finally,the 4G IAC ETL system is designed and completed according to the research results.The main work of this thesis is as follows:Firstly,for big files in the 4G IAC data,after analyzing the impact of the HDFS Block size on the Spark SQL query efficiency,the strategy of dynamically setting the HDFS Block size according to the data size is proposed.Experiments show that under the dynamic setting strategy,the storage of same type data is more balanced and reasonable,and Spark SQL achieves a higher query efficiency.Secondly,for the problem that Spark SQL is inefficient in reading massive small files,the reason is analyzed theoretically.According to the features of 4G IAC data,the local merging storage model is improved to merge and transform a large number of small files,and then the transformed and merged files are stored to HDFS by time partition.For the method of merging small files,the performance of the two schemes based on Java multi-threading and Spark-based merging is comparatively analyzed and tested.The experimental results show that the query efficiency of Spark SQL has been improved significantly after merging and transforming the massive small files.Finally,based on the above research results,this thesis designs and develops the4 G IAC data ETL system.According to the demand of 4G IAC data analysis business,each function module is analyzed and implemented.The system has also passed the customer's running test.
Keywords/Search Tags:4G IAC, Spark SQL, HDFS, Block Storage, Small File
PDF Full Text Request
Related items