The rapid development of information technology has enabled the world to realize the interconnection of all things.At the same time,it has led to the exponential growth of data.Unlike traditional static data,the current data is real-time,massive,and volatile,which we call streaming data.The conventional clustering algorithm can’t meet the demand of this kind of streaming data processing,so the streaming data clustering algorithm research becomes essential.With the appearance of distributed processing framework,it is becoming more and more efficient to process massive and real-time streaming data.At the same time,different clustering algorithms will produce different effects,so how to choose the most appropriate algorithm for a specific data set is also a research hotspot.Aiming at the above problems,the research work of this thesis is as follows:(1)Analyze the streaming data clustering algorithm,study its fundamental principles,advantages,and disadvantages,and introduce the traditional clustering algorithm needed for its internal implementation.The evaluation index of clustering effectiveness and distributed computing platform are analyzed.At the same time,the fuzzy multi-criteria decision-making method is introduced,including fuzzy set,aggregation operator,weight method,and so on.(2)Aiming at the problems existing in the CluStream algorithm,the DD-CluStream algorithm is proposed.The online part adopts a two-stage clustering mode,divided into remote node clustering and central node clustering density-based DBSCAN algorithm replaces the k-means algorithm.Sliding window and attenuation function are introduced to eliminate expired data,give weight to micro-clusters,and reduce the influence of old clusters on newly arrived data.At the end of each window,the central node makes adaptive adjustments to delete expired micro-clusters and outlier data,to improve the clustering effect.At the same time,in the off-line macro-aggregation layer,the DPCA algorithm based on density peak replaces the k-means algorithm,which reduces the instability of clustering results and improves the accuracy of the results.(3)DD-CluStream algorithm is deployed on the Storm platform for parallel processing compared with the other three streaming data clustering algorithms.Through the execution time of clustering under the distributed platform,processing pressure under different threads,and various clustering evaluation index values,the effectiveness of the clustering algorithm under the Storm platform and the advantages of distributed platform for streaming data processing are verified.(4)Establish an algorithm evaluation and optimization model.The Pythagorean fuzzy set represents the evaluation matrix composed of cluster validity evaluation indexes.Meanwhile,the comprehensive weight composed of the objective weight of the maximum deviation method and the subjective weight given according to experts’ prior knowledge gives different weights to the evaluation indexes.Then,the evaluation values are aggregated according to the decision method based on the Pythagorean fuzzy weighted MSM aggregation operator,and the final comprehensive evaluation value is obtained according to the scoring function and the accuracy function to verify the effectiveness of the DD-CluStream algorithm and select the best clustering algorithm for a specific data set. |