Font Size: a A A

The Research Of Distributed Parallel Spectral Clustering Algorithm Based On Data Stream

Posted on:2017-03-26Degree:MasterType:Thesis
Country:ChinaCandidate:M Z ChengFull Text:PDF
GTID:2348330482486927Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In the behaviors and activities in real-world,credit card fraud detection,stocks and securities trading(financial analysis),network intrusion detection,and social network analysis,data emerges in the form of unlimited,real-time and dynamic flow.According to the data stream's characteristics like infinity,instantaneity,orderliness,mass scale,it's believed that clustering algorithm of static data cannot meet requirements of data stream processing.This paper carries out studies from three aspects for the above problems:1.Dividing the Online Offline Spectral Clustering Algorithm(OOSCA)into a double-layer architecture model formed by online layer tabulate data summary structure information and offline layer accurate clustering according to the main framework thinking of CluStream clustering algorithm.Due to the large-scale high-dimensional characteristics of data stream,this paper uses kernel principal component analysis(KPCA)for data dimensionality reduction.Also because landmark window cannot solve sliding window data model,and the sliding window's maintenance of numerous information increases the data storage load,therefore,this paper proposes KPCA-based time decay data stream online clustering method.2.Offline layer uses spectral clustering planning method that is based on graph theory thought,the optimal solution of graph partition can be used to replace the clustering operation of large amounts of data sets.It can be applied to any form of sample collection in real world,and it can approach the optimal solution to the greatest extent.Firstly,this thesis uses New Intuitionistic Fuzzy(NIF)similarity measure method to create a similar matrix.In order to improve the effectiveness and accuracy of clustering,an improved t-nearest neighbor method is adopted to rarefy similar matrix and conduct outliers tuning of its results.The ?-nearest neighbor rough set model is used to calculate the initial cluster center of k-means and carry out data clustering.3.There is large-scale complex calculation in the process of data clustering,so the time complexity of the algorithm is relatively high.When building a similar matrix,solving the first k eigenvectors of Laplacian matrix and computing the initial clustering of center k-means,there is no close interdependence between them,so this paper combines various advantageous properties like Hadoop MapReduce distributed storage and parallel computing,etc.as well as carries out parallel computing for the above three stages,thereby reducing the time spent in clustering complex computing.Experimental results show that the improved data flow distributed parallel spectral clustering algorithm has good effect in many aspects such as clustering quality,accuracy,and reduction of calculation amount.Finally,the related work done by this article is summarized,and prospects of exploration in other aspects are made.
Keywords/Search Tags:spectral clustering, data stream, parallel computing, feature vector, k-means, similar matrix
PDF Full Text Request
Related items