Font Size: a A A

Research On Spark Data Skewing Improvement And Decision Tree Parallelization Application Under Cloud Edge Collaboration

Posted on:2022-11-22Degree:MasterType:Thesis
Country:ChinaCandidate:Z T JiaFull Text:PDF
GTID:2518306743978059Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapidly developing and popularizing of Io T and 5G technology,a large number of data-intensive and computation-intensive and emerging applications with high latency sensitivity requirements have emerged,such as autonomous driving and intelligent door guards for crowded places.Therefore,how to quickly and efficiently process the big data generated by these emerging applications has become an urgent problem for the society.Spark is a fast in-memory computing-based big data processing platform,but there are still some shortcomings.For example,during the operation of Spark,the existence of uneven distribution of native partitioning algorithms and the imbalance of the input data itself can cause data skewing problems,thus affecting the overall task execution efficiency;the machine learning algorithms commonly used in the Spark platform,such as the traditional decision tree algorithm,also have shortcomings in continuous attribute discretization and attribute selection,resulting in inefficient algorithms.Therefore,this thesis studies and optimizes the data skewing problem and the decision tree algorithm in Spark platform based on the cloud edge collaborative environment.The main research contents of this thesis include:(1)Research on Spark data skew optimization based on partitionAiming at the data skewing problem caused by the unbalanced input data distribution and the native partitioning algorithm in the platform when facing data-intensive applications in Spark Shuffle stage,this paper therefore proposes a method to solve the data skewing problem based on the optimization of partitioning(Cluster sampling greedy algorithm,CSGA for short).Through the parallel cluster sampling algorithm to sample the intermediate data after each Map task processing,predict the data distribution and the frequency of Key,so as to give them weights,and then construct a model to measure the data skew of each partition,and combine the greedy algorithm idea to divide the intermediate data,so that the amount of data in each partition is more balanced.The experimental validation of the CSGA method is carried out,and the results show that the method effectively solves the problem of skewed data partitioning and reduces the task execution time with good results and certain universality compared with the Hash and Range partitioning methods in Spark and other improved methods for the data skewing problem.(2)Research on optimization of decision tree algorithm based on SparkC4.5 algorithm is a common algorithm in decision tree,but C4.5 algorithm needs to calculate the information gain of midpoint partitioning between all adjacent attribute values when dealing with continuous-type data,which leads to consuming more time;the attribute selection ignores the interaction between attributes,which affects the accuracy of the algorithm.To address the above issues,this thesis firstly optimizes the C4.5 algorithm and then implements parallelization in the Spark environment under cloud edge collaboration.First,the information entropy calculation in the boundary point definition and Fayyad's theorem is improved by the Gini index to reduce the number of times the traditional C4.5algorithm calculates the information entropy of segmentation points in the calculation of continuous attribute discretization operations,simplifies the calculation formula,and reduces the execution time of the algorithm.On this basis,the CFS algorithm is introduced to optimize the calculation of information gain rate so that better attributes can be selected for decision tree partitioning;then,the data is re-partitioned on the Spark platform by the data partitioning method proposed in this thesis,and the improved C4.5 algorithm is parallelized according to map and reduce operations.Finally,the "intelligent door guard" of the cloud edge collaboration in the epidemic prevention and control management environment is selected as the application scenario for experimental verification,classifying the high and low risks of the people who want to enter the place and judging whether to allow entry.Experiments show that the improved parallel C4.5 algorithm in this thesis reduces the running time and improves the accuracy compared with the traditional C4.5 algorithm and other improvements to the C4.5 algorithm.
Keywords/Search Tags:Spark, Data skew, C4.5 algorithm, Edge Computing, Cloud Edge collaboration
PDF Full Text Request
Related items