The Research Of Distributed Parallel Spectral Clustering Algorithm Based On Data Stream

Posted on:2017-03-26

Degree:Master

Type:Thesis

Country:China

Candidate:M Z Cheng

Full Text:PDF

GTID:2348330482486927

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

In the behaviors and activities in real-world,credit card fraud detection,stocks and securities trading(financial analysis),network intrusion detection,and social network analysis,data emerges in the form of unlimited,real-time and dynamic flow.According to the data stream's characteristics like infinity,instantaneity,orderliness,mass scale,it's believed that clustering algorithm of static data cannot meet requirements of data stream processing.This paper carries out studies from three aspects for the above problems:1.Dividing the Online Offline Spectral Clustering Algorithm(OOSCA)into a double-layer architecture model formed by online layer tabulate data summary structure information and offline layer accurate clustering according to the main framework thinking of CluStream clustering algorithm.Due to the large-scale high-dimensional characteristics of data stream,this paper uses kernel principal component analysis(KPCA)for data dimensionality reduction.Also because landmark window cannot solve sliding window data model,and the sliding window's maintenance of numerous information increases the data storage load,therefore,this paper proposes KPCA-based time decay data stream online clustering method.2.Offline layer uses spectral clustering planning method that is based on graph theory thought,the optimal solution of graph partition can be used to replace the clustering operation of large amounts of data sets.It can be applied to any form of sample collection in real world,and it can approach the optimal solution to the greatest extent.Firstly,this thesis uses New Intuitionistic Fuzzy(NIF)similarity measure method to create a similar matrix.In order to improve the effectiveness and accuracy of clustering,an improved t-nearest neighbor method is adopted to rarefy similar matrix and conduct outliers tuning of its results.The ?-nearest neighbor rough set model is used to calculate the initial cluster center of k-means and carry out data clustering.3.There is large-scale complex calculation in the process of data clustering,so the time complexity of the algorithm is relatively high.When building a similar matrix,solving the first k eigenvectors of Laplacian matrix and computing the initial clustering of center k-means,there is no close interdependence between them,so this paper combines various advantageous properties like Hadoop MapReduce distributed storage and parallel computing,etc.as well as carries out parallel computing for the above three stages,thereby reducing the time spent in clustering complex computing.Experimental results show that the improved data flow distributed parallel spectral clustering algorithm has good effect in many aspects such as clustering quality,accuracy,and reduction of calculation amount.Finally,the related work done by this article is summarized,and prospects of exploration in other aspects are made.

Keywords/Search Tags:

spectral clustering, data stream, parallel computing, feature vector, k-means, similar matrix

PDF Full Text Request

Related items

1	Parallel Clustering Algorithm Based On MapReduce
2	Research And Implementation Of Parallel Clustering Algorithm Based On Approximate Spectrum Hadoop MapReduce
3	Research On Spectral Clustering Of Large Scale Complex Data
4	Spectral Clustering Based On The Graph Theory Algorithms Research And Implementation
5	Parallel K-means Clustering Method And Its Resume Data Applied Research
6	The Research On Parallel Computing Technology In Precise Agricultural Climate Division
7	Research On Processing Methods Of Data Stream Based On Parallel Computing
8	Research And Application Of Spectral Clustering
9	Research On K-MEANS Algorithm Based On GPU Parallel And Its Application In Text Clustering
10	Data Stream Processing Algorithm Based On Cluster Analysis