Spectral Clustering Algorithm Based On Spark And The Application On QAR Data

Posted on:2018-01-27

Degree:Master

Type:Thesis

Country:China

Candidate:X Y Wu

Full Text:PDF

GTID:2348330533960093

Subject:Computer Science and Technology

Abstract/Summary:

Spectral clustering algorithms have been shown to be more effective in finding clusters than some traditional algorithms,due to it converts clustering problem into optimization problem of graph partitioning.The idea of spectral clustering is that each vertex in graph represents a data point,and each edge between two data points is weighted by the similarity between the corresponding data points.For this the undirected weighted graph,we want to find a partition of the graph such that the edges between different subgraphs have very low weights and the edges within a subgraph have high weights using Laplacian matrix.However,spectral clustering suffers from a scalability problems in both memory use and computational time when the size of data set is large.To perform clustering on large data sets,this article proposs a parallel spectral clustering algorithm in response to large-scale matrix with implementation on Spark and Hadoop platform.The main work of this article is as follows:1.For the problem of spectral clustering can’t deal with massive data we designed parallel spectral clustering based on Spark.We get similarity upper triangular matrix via GraphX because it provides graph-parallel computation,then compute Laplacian matrix according to the upper triangular matrix.Lanczos algorithm is used to transform a symmetric matrix to a symmetric tridiagonal matrix by orthogonal similarity transformation,which is one of the most effective methods for soving large-scale eigenvalue problem.This thesis introduces a Lanczos method based on Spark platform which can improve the time efficiency the clustering algorithm.Then using parallel k-means algorithm on the cluster middle results to get the final clusters.2.Building data warehouse of QAR based on Hive.First,we researched the architecture of HDFS and then designed a HDFS visualization system based on WEB interface.And on this basis we designed architecture and storage of QAR data warehouse based on Hive,to solve the relational database is not sufficient to support massive data storage and analysis.3.Using the parallel spectral algorithm on QAR data.Experiments show that the QAR data warehouse based on Hive can satisfy the requirements of data mining and spectral clustering is effective.

Keywords/Search Tags:

Spark, Spectral clustering, QAR, Hive, Hadoop

Related items

1	Agricultural Product Price Analysis And Forecast System Design Based On Hadoop+Spark Platform
2	The Research And Application Of Security Log Clustering Mining Algorithm Based On Hadoop Platform
3	Design And Implementation Of Weibo Data Mining System Based On Hadoop Platform
4	Design And Implementation Of Massive Web Log Analysis System Based On Hadoop/Hive
5	Research Of Parallel Text Spectral Clustering Algorithm Based On Spark
6	Parallel Spectral Clustering Algorithm Based On Hadoop
7	Research On Spectral Clustering Algorithm Based On Hadoop Platform
8	Design And Implementation Of Advertising Business Data Management Platform
9	Rock Image Clustering Analysis Algorithm Research Based On Spark
10	The Design And Implementation Of Network Authentication System Based On Hadoop/hive