Research And Application Of Big Data Clustering Algorithm Based On Spark Platform

Posted on:2019-09-30

Degree:Master

Type:Thesis

Country:China

Candidate:L Liu

Full Text:PDF

GTID:2428330566499347

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

In the current situation of the rapid development of information society?In the high-speed of development,it will undoubtedly produce huge amounts of data,and data classification is also unable to avoid the out of order,which result the increased demand dramatically in massive data clustering.Under such a background,there are two problems.The first problem is that the traditional clustering algorithm has been unable to meet the needs of today's data complexity,so it is urgent to improve the algorithm or put forward a new algorithm.The second problem is that the hardware configuration bottleneck of a single machine has been unable to meet the processing of massive data.Then the distributed platform of cluster mode has gradually replaced the traditional single server in large data processing.In particular,the emergence of Spark distributed computing framework based on memory makes most of the processing problems of mass data be solved.This paper optimizes the clustering algorithm BIRCH(Balanced Iterative Reducing and Clustering using Hierarchies)and implements its parallel running on the Spark distributed computing framework.In order to improve the BIRCH algorithm,the main work of this thesis is as follows:(1)For mass data,we first collect and preprocess.The specific operation is to compress data,reduce data volume and reduce the pressure of massive data processing.Before compression,we backup the original data to guard against loss.(2)In view of the characteristics of BIRCH algorithm input data order sensitivity,this paper first uses K-means algorithm to carry out rough clustering operation.The number of clusters is less and the threshold parameters are relatively large,the purpose is to quickly give K clusters.K clusters are arranged in full array.Finally,all or part of them are arranged according to the size of the arrangement to carry out the next clustering operation.(3)Improvement on the basis of BIRCH clustering algorithm to enable it to run parallel in the Spark platform,the K-means algorithm is introduced into the improved method,and the BIRCH algorithm forms clustering feature trees for each node in the cluster,and then the K-means algorithm is used to achieve parallelism,so that the BIRCH algorithm can be parallel to the Spark platform.Operation.Through experiments,the improvement of clustering in this paper has a certain improvement in performance,and gives a certain display in the analysis of mobile GPS data,giving a good analysis result for the division of population intensity and the peak time of area.The optimization algorithm in this paper greatly improves the efficiency of the analysis and calculation in the application of the system.

Keywords/Search Tags:

Clustering, Big Data, BIRCH, Spark, Parallelization

PDF Full Text Request

Related items

1	Research And Application Of Parallelization Optimization Of Spatial Clustering Algorithm Based On Spark
2	Research On K-medoids Clustering Algorithm Based On Spark
3	Research On Parallelization Of Data Stream Clustering Algorithm For Police Data
4	The Design And Implementation Of Parallelization Of Canopy And FCM Clustering Algorithms On Spark Platform
5	The Parallelization And Optimization Of K-means Algorithm Based On Spark
6	Research On Cluster Analysis Technology Of Component Size Measurement Data Based On Spark
7	Research And Implementation Of Classification Algorithm Parallelization Based On Spark
8	Research And Implementation Of Large-Scale And Efficient Clustering Algorithm Based On Spark
9	The Optimization Of Clustering And Classification Algorithms Based On SPARK
10	Implementation And Application Of Clustering Algorithm Based On Spark