Research And Design Of Clustering Ensemble System Based On Spark

Posted on:2016-12-03

Degree:Master

Type:Thesis

Country:China

Candidate:T Wang

Full Text:PDF

GTID:2308330461470444

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the development of cloud computing, big data applications have wider expansion and extension. The value of big data receives increasing attention, and the requirement of real-time and effectiveness data processing is also rising. Application of cluster analysis in big data makes it easier to access to information, knowledge and decision support from the data. The traditional clustering algorithms in processing of massive or high-dimensional data are not fast enough, and the performance of single clustering algorithm is not satisfactory in treatment of new data. When using of large-scale data, they can not run efficiently with the limit of memeory size, and thus the traditional clustering algorithms have been difficult to meet the needs of current practical applications.To improve the clustering performance, clustering ensemble has been proven to greatly improve the performance of traditional clustering algorithms. MapReduce, a parallel computing model, makes a lot of users to analyze large data sets on clusters. However, the MapReduce model is not a panacea, in order to improve the accuracy of clustering results especially when dealing with great large amount of records, this thesis proposes a Distributed Clustering Ensemble algorithm (DisCE) based on Resilient Distributed Datasets (RDDs), which takes full advantage of the RDDs model and the clustering ensemble algorithm, and effectively improves the clustering performance and processing capabilities of big data applications. In this Algorithm, a RDDs-based Distributed Adjacency List is firstly designed, which is used to store and retrieve data of Co-association Matrix from distributed environment. Next, results are presented as the form of Distributed Adjacency List, with the advantage of distributed consensus function model as well as several clustering results achieved from big data. Lastly, the Distributed Adjacency List is divided into the final clustering result by adopting optimized Affinity Propagation algorithm.Spark is a new generation of distributed big data processing framework after Hadoop. This thesis design and implement a clustering ensemble system based on Spark which achieves mass data storage, processing, interoperability and provides high reliability for big data applications. In the system architecture design, the hierarchical and component oriented design ideas are used to build the system. The hierarchical design from bottom to top is: distributed computing layer, the basic platform layer, algorithm analysis layer, cloud service layer and user client layer. In the process of system implementation, the currently most popular software frameworks are fully used, which short the development cycle of the system and improve the quality of the system.Finally, in system test, this thesis carries out the accuracy test of the system core algorithms and the system performance test. The analysis of the test results demonstrates the effectiveness and practicality of the work.

Keywords/Search Tags:

Clustering Analysis, Clustering Ensemble, Distributed Computation, Spark, Resilient Distributed Datasets(RDDs)

PDF Full Text Request

Related items

1	Analysis Of The Clustering Algorithm On Data Stream Using Resilient Distributed Datasets
2	Research On Multi-View Subspace Clustering Ensemble And Its Distributed Implementation
3	The Design And Implementation Of Log Analysis System Based On Spark
4	Research On Distributed Clustering Algorithm Based On Spark And Implementation On Social Media Analysis
5	Multi-Kernle Spectral Clustering Based On Incomplete Multiple Views And Its Distributed Implementation
6	Research On The Effectiveness Element Theory And Method Of Clustering Ensemble
7	Fractal Analysis Of Datasets Using Distributed Computing
8	New Methods For Cluster Analysis In Distributed Environments
9	Research On Clustering Algorithm Based On Distributed Platform
10	Research On Spark Caching Strategy Based On Task Structure Optimization