Font Size: a A A

Research On Parallel Clustering Algorithm For Streaming Data

Posted on:2016-07-19Degree:MasterType:Thesis
Country:ChinaCandidate:Z J XuFull Text:PDF
GTID:2208330464463536Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the popularity of computers, mobile devices and Internet of Things, various kinds of web applications are ubiquitous gradually. And the result of that is mass multisource, heterogeneous steam data being generated, such as network intrusion data, stock data, and weather monitoring data and so on. For steam data owning diverse, temporal, massive, continuous arrival and potential endless properties, the task to mine information from which data is ever more complicated and challengeable. As an unsupervised learning way, cluster analysis can classify the data by the division result based on the similarity among data. Therefore, as an important method for data mining, cluster analysis is contributing to extract potential, unknown and valuable information from massive data.On the one hand, stream data based mining cannot store all data persistently for both limited memory space and poor efficiency of disk I/O operations. On the other hand, resulting from the stream data processing methodology, real-time and online data mining is impossible. Thus, how to process stream data in real-time, efficient and reliable way becomes a notable difficult problem. In recent years, the rapid rise and application of parallel and distributed computing, cluster architecture and related technologies shed light on real-time mining for large scale stream data. In this thesis, parallelization is introduced into stream data cluster analysis, making the clustering algorithms parallelized in the distributed memory computing framework of Spark. Thus, data can be distributed among computers for parallel processing, which then can provide real-time, high-throughput and high-fault tolerant performance. Details are as follows:(1) According to the characteristics of stream data, we make a detailed study on clustering algorithms based on stream data and divide them into different categories. After that, working principles and implementation mechanisms of MapReduce programming model on the Hadoop, which is a distributed computing framework, are analyzed. Then, by comparing MapReduce with Spark model focusing on the parallel processing stream data, advantages of the Spark are summarized.(2) Focusing on the timeliness and parameter sensitivity problem, we propose a stream data clustering algorithm named CluWin-GA, which combines variable-length sliding window and genetic algorithm, based on the review of CluStream algorithm. The analysis of experimental results shows that the algorithm has better timeliness and reliability performance, and indeed, it is a novel algorithm with dynamically adaptive ability for clustering stream data.(3) Memory and parallel computing theory is introduced to the procedure of clustering stream data first, and then we propose an improved parallel strategy of clustering algorithm. With the help of Spark platform, paralleled clustering algorithms of CluStream algorithm and the improved CluWin-GA algorithm were realized separately. Experiment results show that the two improved parallel clustering algorithms can all cluster stream data in real time, high efficiency, and reliable performance.In conclusion, aiming at stream data features, based on sliding window technique and genetic algorithm, a two-layer architecture clustering algorithm for stream data is proposed at first; Secondly, by introducing memory computing and parallel idea to clustering algorithm, we realize paralleled clustering algorithms for stream data in the framework of Spark. Therefore, it is not only laying a foundation for further research, but also having high theoretical and practical significance for parallel clustering stream data in the environment of big data and cloud computing.
Keywords/Search Tags:Parallelization, Cluster, Stream Data, MapReduce, Spark
PDF Full Text Request
Related items