Font Size: a A A

Research And Design Of Clustering Ensemble System Based On Spark

Posted on:2016-12-03Degree:MasterType:Thesis
Country:ChinaCandidate:T WangFull Text:PDF
GTID:2308330461470444Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of cloud computing, big data applications have wider expansion and extension. The value of big data receives increasing attention, and the requirement of real-time and effectiveness data processing is also rising. Application of cluster analysis in big data makes it easier to access to information, knowledge and decision support from the data. The traditional clustering algorithms in processing of massive or high-dimensional data are not fast enough, and the performance of single clustering algorithm is not satisfactory in treatment of new data. When using of large-scale data, they can not run efficiently with the limit of memeory size, and thus the traditional clustering algorithms have been difficult to meet the needs of current practical applications.To improve the clustering performance, clustering ensemble has been proven to greatly improve the performance of traditional clustering algorithms. MapReduce, a parallel computing model, makes a lot of users to analyze large data sets on clusters. However, the MapReduce model is not a panacea, in order to improve the accuracy of clustering results especially when dealing with great large amount of records, this thesis proposes a Distributed Clustering Ensemble algorithm (DisCE) based on Resilient Distributed Datasets (RDDs), which takes full advantage of the RDDs model and the clustering ensemble algorithm, and effectively improves the clustering performance and processing capabilities of big data applications. In this Algorithm, a RDDs-based Distributed Adjacency List is firstly designed, which is used to store and retrieve data of Co-association Matrix from distributed environment. Next, results are presented as the form of Distributed Adjacency List, with the advantage of distributed consensus function model as well as several clustering results achieved from big data. Lastly, the Distributed Adjacency List is divided into the final clustering result by adopting optimized Affinity Propagation algorithm.Spark is a new generation of distributed big data processing framework after Hadoop. This thesis design and implement a clustering ensemble system based on Spark which achieves mass data storage, processing, interoperability and provides high reliability for big data applications. In the system architecture design, the hierarchical and component oriented design ideas are used to build the system. The hierarchical design from bottom to top is: distributed computing layer, the basic platform layer, algorithm analysis layer, cloud service layer and user client layer. In the process of system implementation, the currently most popular software frameworks are fully used, which short the development cycle of the system and improve the quality of the system.Finally, in system test, this thesis carries out the accuracy test of the system core algorithms and the system performance test. The analysis of the test results demonstrates the effectiveness and practicality of the work.
Keywords/Search Tags:Clustering Analysis, Clustering Ensemble, Distributed Computation, Spark, Resilient Distributed Datasets(RDDs)
PDF Full Text Request
Related items