Font Size: a A A

Spectral Clustering Algorithm Based On Spark And The Application On QAR Data

Posted on:2018-01-27Degree:MasterType:Thesis
Country:ChinaCandidate:X Y WuFull Text:PDF
GTID:2348330533960093Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Spectral clustering algorithms have been shown to be more effective in finding clusters than some traditional algorithms,due to it converts clustering problem into optimization problem of graph partitioning.The idea of spectral clustering is that each vertex in graph represents a data point,and each edge between two data points is weighted by the similarity between the corresponding data points.For this the undirected weighted graph,we want to find a partition of the graph such that the edges between different subgraphs have very low weights and the edges within a subgraph have high weights using Laplacian matrix.However,spectral clustering suffers from a scalability problems in both memory use and computational time when the size of data set is large.To perform clustering on large data sets,this article proposs a parallel spectral clustering algorithm in response to large-scale matrix with implementation on Spark and Hadoop platform.The main work of this article is as follows:1.For the problem of spectral clustering can't deal with massive data we designed parallel spectral clustering based on Spark.We get similarity upper triangular matrix via GraphX because it provides graph-parallel computation,then compute Laplacian matrix according to the upper triangular matrix.Lanczos algorithm is used to transform a symmetric matrix to a symmetric tridiagonal matrix by orthogonal similarity transformation,which is one of the most effective methods for soving large-scale eigenvalue problem.This thesis introduces a Lanczos method based on Spark platform which can improve the time efficiency the clustering algorithm.Then using parallel k-means algorithm on the cluster middle results to get the final clusters.2.Building data warehouse of QAR based on Hive.First,we researched the architecture of HDFS and then designed a HDFS visualization system based on WEB interface.And on this basis we designed architecture and storage of QAR data warehouse based on Hive,to solve the relational database is not sufficient to support massive data storage and analysis.3.Using the parallel spectral algorithm on QAR data.Experiments show that the QAR data warehouse based on Hive can satisfy the requirements of data mining and spectral clustering is effective.
Keywords/Search Tags:Spark, Spectral clustering, QAR, Hive, Hadoop
PDF Full Text Request
Related items