Font Size: a A A

Research And Implementation Of Classification Algorithms Based On Spark

Posted on:2018-12-13Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiuFull Text:PDF
GTID:2348330518995566Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rise of big data technology and the arrival of the era of big data, how to quickly and effectively extract hidden rules and values from massive data has become the urgent needs of people. Spark is a high-reliability, high-performance distributed computing framework in recent years, it uses a programming model based on memory computing, so it's particularly suitable for iterative computing. Spark has unparalleled performance advantages and can adapt to a variety of large data applications. Classification is one of the most active and popular research fields in data mining. Based on Spark platform, this paper studies the parallelization of classification algorithms. The main research works are as follows:In this paper, the classical decision tree algorithm in classification algorithm is studied, and the C4.5 algorithm, CART algorithm and CHAID algorithm, which are representative in the decision tree, are selected for comparative study. According to the three stages of constructing decision tree (preprocessing, training and pruning), the algorithm is designed and implemented in parallel. The experimental results show that the three parallel decision tree algorithms have good performance.This paper studies the shortcomings of the traditional BP algorithm,and improves it from many aspects, such as the self-adaptive learning rate,the momentum factor, the mini-batch gradient descent method, the cross entropy cost function, the early stop criteria and so on. The improved algorithm is implemented based on Spark platform, which ensures the accuracy and reduces the training time greatly.In this paper, the equivalence between decision tree and neural network is studied. The neural network is constructed by decision tree method, which can solve the problem that neural network structure and initial parameters are difficult to be determined. The decision tree algorithm and the BP neural network algorithm are combined to complete the parallelization of the decision tree-based neural network algorithm. The experimental results show that this method not only improves the accuracy,but also accelerates the training speed and has high practicability.
Keywords/Search Tags:Spark, classification algorithms, parallelization, decision tree, BP algorithm
PDF Full Text Request
Related items