Research On Parallelization Of Clustering Algorithm Based On MapReduce

Posted on:2017-11-17

Degree:Master

Type:Thesis

Country:China

Candidate:D C Liu

Full Text:PDF

GTID:2348330488989359

Subject:Computer application technology

Abstract/Summary:

With the process of information society, the rapid emergence of mass data, a number of parallel data mining algorithms have been proposed. Cluster analysis is a powerful analytical tool for data mining, its salient features is that does not require any prior knowledge or information, are unsupervised learning. K-means clustering algorithm is a typical partition-based algorithm is simple and easy to implement, but there are also disadvantages, such as the initial cluster centers sensitive, easy to fall into local optimum. Faced with massive amounts of data and large-scale high-dimensional data types, the traditional computing model has been difficult to provide the necessary processing capacity, hadoop cloud platform for data processing provides a new way.Deepening and propulsion of the power system intelligent building, power system data showing a massive, high-dimensional trend. Today’s increasingly serious global energy issues at home and abroad to build smart grid have carried out further, along with the generation of electricity data grows exponentially, becoming a big data public concern. Given the large data that appears more and more situations in the grid, it is necessary to combine Hadoop cloud platform, with a distributed redundant storage and parallel computing, etc., on the power of massive data reliably and efficiently process research.In this paper, parallelization research for K-means algorithm of clustering using MapReduce parallel framework and modeling of bad data in the power system for detection and identification, the main research work is as follows:First, the traditional clustering algorithms can not meet the handling massive data, inadequate analysis of existing K-means algorithm, based on random sampling and the introduction of the maximum and minimum distance method and other technologies, combined with parallel computing MapReduce framework is proposed based on MapReduce improved K-means clustering algorithm-MR-IKmeans(MapReduce-based Improved K-means). First, the data set multiple random sampling, then use the two-stage maximum minimum distance method to produce the best initial cluster centers, and finally with K-means clustering algorithm. UCI-known experimental data set selection on Hadoop clusters show: This algorithm is superior to the traditional K-means algorithm convergence speed and accuracy of clustering, and has excellent performance in parallel processing huge amounts of data.Secondly, the power system will cause bad data to reduce power system state estimation accuracy of the results, and when traditional clustering algorithms handle massive high-dimensional data single computing resource, in recent years, more popular MapReduce framework can not effectively deal with frequent iteration and other issues, a new method of parallel K-means algorithm based on identification Spark bad data. In a node power load data for the study, using the extracted daily load characteristic curve based on parallel K-means clustering algorithm Spark respectively on the grid state estimation of bad data detection and identification. Selection of real power load data EUNITE provide experimental results show that this method can effectively improve the accuracy of the state estimation result, compared with a method based on the MapReduce framework, it has better acceleration than the scalability, better able to deal with electricity system of massive data.Cloud computing clusters in the laboratory building and experimental tests and numerical example, results show that the proposed algorithm is fast and efficient, the new method works Spark and cluster analysis based on bad data identification of good to meet the massive processing power system demand for high-dimensional data, to ensure power system state estimation accuracy has a very important value.

Keywords/Search Tags:

cluster analysis, MapReduce model, bad data detection and identification, Spark, K-means algorithm

Related items

1	Research On Spark Oriented Fuzzy C-means Clustering Algorithm
2	Research And Application Of K-means Algorithm Based On Density And Distance
3	Research On The Implementation Of Bursty Events Detection Based On Spark
4	Parallelizing K-means-based Clustering On Spark
5	Research On Parallel K-means Algorithm Based On Genetic Algorithm
6	Research On Parallel Sampling K-Means Algorithm Based On MapReduce
7	Research And Application Of K-means Clustering Algorithm
8	Research On Parallel Clustering Algorithm For Large - Scale Data Set
9	Research On Parallelization Of Data Mining Algorithm Based On Distributed Platforms Spark And YARN
10	Research On Parallel Clustering Algorithm For Streaming Data