Font Size: a A A

English On Design And Implementation Of Network Data Parallel Processing System Based On Hadoop Platform

Posted on:2018-09-21Degree:MasterType:Thesis
Country:ChinaCandidate:X Z JiFull Text:PDF
GTID:2348330515485639Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
With the arrival of the mobile Internet era to bring a variety of convenience to people's lives but also means that will produce massive amounts of data,how to dig the value from this massive amount of data will be a very valuable topic.Clustering algorithm is one of the tools to extract value from massive amounts of data that has a very wide use of the scene,including the classification of some unknown items.With the increase in the amount of data,clustering algorithm in the stand-alone environment began more and more difficult,more and more facing bottlenecks.So the clustering algorithm and the correspong processing system should be improved in architecture design to resolve or mitigate the data size problem.This paper is design and implementation of Network Data Parallel Processing System Based on Hadoop Platform.Firstly,the research mainly studies the performance of Spark,which includes two parts:research on performance optimization in development process and shuffle performance optimization.Research on performance optimization in the development process focuses on the avoidance use of shuffle operators and the persistence of multiple use of RDD.shuffle performance optimization mainly studies sort shuffle and hash shuffle and their respective application scenarios.In order to develop a parallel clustering algorithm to deal with the problem of massive data processing,this paper introduces the Hadoop platform and builds the Spark platform on Hadoop platform.In this paper proposes a k-means algorithm by using the kruskal algorithm based on the Spark platform to solve the initial center selection problem and to reduce the number of iterations for the k-means algorithm.In order to better demonstrate the experimental results,this paper uses Spark's k-means++ algorithm as a comparison object,The experimental result shows that the k-means algorithm by using kruskal algorithm based on the Spark platform is less running time and fewer iterations than Spark's k-means++ algorithm.Aiming at the problem that k-means algorithm does not consider the similarity between vectors,this paper proposes a k-means algorithm by using kruskal algorithm and valley distance based on Spark platform.Using square error function as evaluation index,Compared with Spark's k-means++ algorithm and the k-means algorithm by using kruskal algorithm based on the Spark platform,there are fewer squared error function values and better clustering results.At the end of this paper,a complete network data processing system is built that the system makes the system itself has the ability to calculate large data and high complexity data.The introduction of Hadoop computing platform allows the system to rely on cheap hardware resources,providing high computing power and storage capacity,but also the system has a good horizontal scalability,with data scale rise,only need to simply add the machine to increase the cluster processing capacity.In addition,the network data parallel processing system has universal applicability,not only for film recommendation,anomaly detection,but also for any use of clustering algorithm for data processing scenarios.
Keywords/Search Tags:Clustering, Hadoop, Spark, Performance Tuning
PDF Full Text Request
Related items