Font Size: a A A

Research And Application Of Big Data Clustering Algorithm Based On Spark Platform

Posted on:2019-09-30Degree:MasterType:Thesis
Country:ChinaCandidate:L LiuFull Text:PDF
GTID:2428330566499347Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In the current situation of the rapid development of information society?In the high-speed of development,it will undoubtedly produce huge amounts of data,and data classification is also unable to avoid the out of order,which result the increased demand dramatically in massive data clustering.Under such a background,there are two problems.The first problem is that the traditional clustering algorithm has been unable to meet the needs of today's data complexity,so it is urgent to improve the algorithm or put forward a new algorithm.The second problem is that the hardware configuration bottleneck of a single machine has been unable to meet the processing of massive data.Then the distributed platform of cluster mode has gradually replaced the traditional single server in large data processing.In particular,the emergence of Spark distributed computing framework based on memory makes most of the processing problems of mass data be solved.This paper optimizes the clustering algorithm BIRCH(Balanced Iterative Reducing and Clustering using Hierarchies)and implements its parallel running on the Spark distributed computing framework.In order to improve the BIRCH algorithm,the main work of this thesis is as follows:(1)For mass data,we first collect and preprocess.The specific operation is to compress data,reduce data volume and reduce the pressure of massive data processing.Before compression,we backup the original data to guard against loss.(2)In view of the characteristics of BIRCH algorithm input data order sensitivity,this paper first uses K-means algorithm to carry out rough clustering operation.The number of clusters is less and the threshold parameters are relatively large,the purpose is to quickly give K clusters.K clusters are arranged in full array.Finally,all or part of them are arranged according to the size of the arrangement to carry out the next clustering operation.(3)Improvement on the basis of BIRCH clustering algorithm to enable it to run parallel in the Spark platform,the K-means algorithm is introduced into the improved method,and the BIRCH algorithm forms clustering feature trees for each node in the cluster,and then the K-means algorithm is used to achieve parallelism,so that the BIRCH algorithm can be parallel to the Spark platform.Operation.Through experiments,the improvement of clustering in this paper has a certain improvement in performance,and gives a certain display in the analysis of mobile GPS data,giving a good analysis result for the division of population intensity and the peak time of area.The optimization algorithm in this paper greatly improves the efficiency of the analysis and calculation in the application of the system.
Keywords/Search Tags:Clustering, Big Data, BIRCH, Spark, Parallelization
PDF Full Text Request
Related items