Font Size: a A A

Research On Real-time Online Evolving Clustering Algorithm For Streaming Data

Posted on:2019-01-17Degree:MasterType:Thesis
Country:ChinaCandidate:L TianFull Text:PDF
GTID:2428330596966422Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology,the data generated in many emerging fields is in the form of the stream,which of them carry a lot of valuable information.Usually,people deal with these data by clustering and classification techniques to dig out the information that people need to gain greater benefits.At present,most of clustering algorithms are oriented to offline static data.These algorithms first need to store all the data in memory,and then get the final clustering result by traversing all the data many times.However,the streaming data is arriving continuously according to time series,and its size is uncertain.The process of receiving and processing of them usually need to be carried out synchronously.The time-series of the streaming data determines that they are online data streams,and the clustering algorithm for streaming data must be able to deal with online data.The uncertainty of the size of the streaming data determines that it is impossible and unnecessary to store all the data in a limited memory space.When single data arrives,the clustering algorithm for streaming data must be able to process it and feed back the result in real time.In order to improve the accuracy of the clustering algorithm for streaming data,this paper proposes a DBI-based online evolving clustering algorithm for streaming data(DBIECM)based on the evolving clustering method(ECM).The DBIECM algorithm uses the mean method to update the clustering center,using the maximum distance from the sample data in the clustering to its clustering center to update the clustering radius,and introduces the Davies-Bouldin Index(DBI)as an evaluation criterion for data classification.Compared with the ECM algorithm,the DBIECM algorithm preserves the ability of online clustering for streaming data and improves the clustering performance.In order to preserve the characteristics of real-time,online and one-pass of the ECM algorithm and improve the clustering performance,this paper proposes the real-time online clustering algorithms for streaming data(SD-ECM)based on the ECM algorithm.The SD-ECM algorithm not only designes a new updating method of the clustering center and the clustering radius,but also designes a triple feature vector to represent the features of the clustering.Without access to historical data,the SD-ECM algorithm can complete the clustering processing for single data with a lower amount of computation by using the feature vectors.Compared with the ECM algorithm and the DBIECM algorithm,the SD-ECM algorithm has better clustering performance.In the study,we found that the SD-ECM algorithm,the DBIECM algorithm and the ECM algorithm are very sensitive to the parameter threshold Dthr,and the value of Dthr directly affects the final number of clustering and the quality of clustering.In order to solve the problem that the parameter threshold Dthr is set to be unreasonable due to the lack of prior knowledge to lead to the number of clustering is too large,this paper adds a parameter threshold MaxNum for limiting the maximum number of clustering on the basis of the SD-ECM algorithm to proposes a real-time online evolving clustering algorithm for streaming data based on optimized parameter thresholds(ECMStream).The ECMStream algorithm retains the characteristics of on-pass and reduces the difficulty of setting the parameter thresholds Dthr,so that the parameter threshold Dthr can be incrementally updated adaptively.Compared with other algorithms in this paper,the ECMStream algorithm has better performance and faster efficiency for real-time online clustering of streaming data.Since the expired data in the clustering result may no longer have reference value,this paper designs a processing method with low time complexity and low space complexity to update the feature vector of the clustering while deleting the expired data,which eliminates the impact of the expired data on the current clustering process.
Keywords/Search Tags:streaming data, online clustering, evolving clustering, real-time clustering, one-pass algorithm
PDF Full Text Request
Related items