Font Size: a A A

Research On Stream Data Clustering Algorithm Based On Storm

Posted on:2017-05-13Degree:MasterType:Thesis
Country:ChinaCandidate:K MaFull Text:PDF
GTID:2308330488997130Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Nowadays the technology of cloud computing and Internet of Things(IoT) becomes increasingly mature. Data, which produced by variety of message service and technology daily TB or even PB, means the time of big data is coming. The features of Big data include big volume, high speed, manifold and low value density. Thus, how to deal with this kind of data is an issue we should face.The thesis, based on the stream processing model in big data environment, focused on the research of stream data clustering algorithms. The research focused not only on the improvement of clustering accuracy but also on the method of improving clustering efficiency(distribution and parallelism of stream data clustering algorithm). Moreover, the thesis, based on real-time computation system Storm, designed a distributed and parallelized stream data clustering algorithm and realized it.In the aspect of improving clustering accuracy, the thesis improved the classical stream data clustering algorithm CluStream. Compared with Euclidean Distance, Mahalanobis Distance can consider the relation among different attributes and the relation isn’t affected by different dimensions. According to the premier conclusion, the thesis replaced Euclidean Distance with Mahalanobis Distance in stream data and designed a new stream data clustering algorithm M-Clustering(Mahalanobis-Clustering). In addition, the thesis designed a contrast test between CluStream and M-Clustering in simulated Storm environment and the result of the test showed that M-Clustering improved the clustering accuracy effectively.In the distribution and parallelism aspect, the thesis, which focused on the micro-clustering part of CluStream, designed a Distributed Parallelized Real-time Clustering algorithm for Stream Data DPRCluStream and divided the micro-clustering into two parts: local and global. The local part was processed with multithreading and the global part merged the middle results. The experimental result in Storm cluster environment showed that the accuracy of DPRCluStream was close to static clustering algorithm k-means, and the efficiency increased nearly linearly with the increase of local nodes and the clustering accuracy remained stable.Adapting to the present big data environment, the researches have certain practicability and theoretical value in the thesis.
Keywords/Search Tags:stream data, cluster, Mahalanobis Distance, distributed, Storm
PDF Full Text Request
Related items