Font Size: a A A

Density-based Clustering Algorithm On Streaming Data

Posted on:2022-07-19Degree:MasterType:Thesis
Country:ChinaCandidate:W Q ZhouFull Text:PDF
GTID:2518306557969239Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
The emergence of the Internet of Things(Io T)has led to the production of huge volumes of real-word data,and streaming data has become a new form of data.How to mine the information contained in data streams has gradually become a research hotspot.Fast processing of data streams in limited memory to obtain high-quality clusters of data stream clustering technology is an important direction of data stream mining.However,data stream clustering faces many challenges.The density-based data stream clustering method can find clusters of any shape,but it has the disadvantages of difficult parameter setting and low clustering accuracy in the concept drift environment;the existing data stream clustering method cannot well cope with the massive diversified data and real-time clustering requirements of multi-source heterogeneous data streams.The density peak clustering algorithm is a density-based clustering method proposed in recent years.The algorithm can find clusters of any shape,and has a good clustering effect.So it is suitable for static data clustering,but not suitable for streaming data clustering.However,under our research work,the density peak algorithm has been improved.The main research work includes three aspects:First,in view of the problem that the density peak clustering algorithm is sensitive to the selection of the cutoff distance,and requires manual observation of the decision graph,an improved density peak clustering algorithm(AutoDensity Peak Clustering,AutoDPC)is proposed.AutoDPC uses Jaccard's similarity coefficient as the basis for judging the density of data points,and defines new density calculation rules.In addition,a heuristic search strategy is introduced to automatically select cluster centers,which avoids the error caused by subjectively observing the decision graph.Experimental results show that the improved AutoDPC algorithm is not sensitive to model selection factors such as cutoff distance,and can automatically obtain the correct number of clusters,which has a better improvement in clustering effect,and the time efficiency of clustering is not significantly reduced.Second,to solve the problem of poor clustering effect caused by concept drift and memory limitation in a complex data flow environment,a density peak clustering algorithm based on index bucket storage optimization is proposed.The algorithm is based on AutoDPC to extract the data summary that evolves in the concept drift environment,and uses the bucket sequence that collects data at the time point of the exponential change to dynamically maintain the flow data summary,and greatly reduces the data while keeping the time window span unchanged.The memory footprint of the stream.That is,providing fine-grained storage for recent data and coarse-grained storage for long-term data is in line with the concept that data flow concept drift should pay more attention to recent data.On artificial data sets and real data sets,the results of comparative experiments with the Den Stream algorithm show that the algorithm occupies less storage space,can provide multi-granularity clustering queries,and has better data streams with concept drift Clustering effect.Third,for the problem of how to efficiently cluster data streams in a multi-source data stream scenario,a distributed density peak clustering algorithm is proposed and deployed to the Apache Storm platform.According to the idea of edge computing,the algorithm decomposes the clustering task into edge node part and central node part,and aims to decentralize more clustering tasks to edge nodes to relieve the pressure caused by centralized processing of multi-source data streams.Edge nodes use AutoDPC to cluster the data stream into local microclusters,and complete the incremental update of local microcluster information.The central node is responsible for aggregating the local microclusters of the edge nodes,completing the synthesis of the global microclusters,and feeding back the clustering results to the edge nodes to update the local microclusters.Experiments show that the algorithm has good clustering efficiency and scalability.As the number of parallel threads increases,the clustering speed of the algorithm increases linearly,and the clustering accuracy remains stable.
Keywords/Search Tags:Data stream, Concept drift, Distributed computing, Density peak clustering algorithm, Cutoff distance, Decision graph
PDF Full Text Request
Related items