Font Size: a A A

Research On Machine Learning Clustering Algorithms In The Hadoop Development Environment

Posted on:2019-04-19Degree:MasterType:Thesis
Country:ChinaCandidate:C SunFull Text:PDF
GTID:2428330572950309Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
As a modeling technique,cluster analysis plays an important role in data mining and machine learning.The purpose of clustering is to obtain regularity and organizational structure from data without category information.Specifically,dividing a given data set into multiple classes or clusters to make the similarity of objects in the same cluster larger,and the similarity of objects in different clusters smaller.Clustering technology has been widely used in social networking,biology,medicine,engineering,transportation and so on.However,with the development of information technology,the amount of data generated in various fields is exponentially increased.Traditional clustering algorithms are not sufficient to meet the current data scale.It is of practical significance to develop efficient and scalable parallel clustering methods to analyze big data.The emergence of distributed platforms such as Hadoop and Spark effectively solve the reliable storage and processing of big data,and also opens a new research direction for the expansion of clustering algorithms.This thesis focuses on improving the efficiency and scalability of clustering algorithms.By deploying the clustering algorithm to the distributed platform architecture,effectively solved the problems of traditional clustering algorithms and software tools that are difficult to deal with big data clustering.Thesis mainly has the following work:(1)Clustering algorithm are divided into different types according to the features of the data and the required clustering characteristics.After introducing and comparing the various clustering algorithms,this thesis choose the partition-based K-means algorithm and more accurate Fuzzy C-means algorithm(FCM),and introduces the basic principle and execution flow of the two algorithms in detail.(2)In view of the poor time-effectiveness of K-means and FCM algorithms for massive data,parallel implementation of the algorithms on Hadoop and Spark distributed platforms is proposed.The parallel algorithm based on Hadoop Map Reduce framework adds Combine function on the basis of designing Map function and Reduce function.By merging the data in the current Map node,the cost of I/O communication between nodes is reduced and the computational efficiency is improved.The parallel algorithm based on the Spark framework saves the intermediate results into memory using the resilient distributed data set(RDD)data structure,so the algorithm can iterate the RDD data efficiently and avoid the disk I/O overhead.In addition,the FCM algorithm has the disadvantage of high complexity.The parallel FCM algorithm based on the Spark framework proposed in this thesis avoids directly storing the partition matrix,reduces the space requirement of the algorithm,and also reduces the time requirement to some extent.(3)In order to compare the performance of the clustering algorithm on two large data processing platforms Hadoop and Spark,the parallel algorithms based on Map Reduce and Spark frameworks are analyzed based on the performance evaluation indexes such as runtime,cluster quality,speedup and so on.The experimental data set uses randomly generated artificial data and real data.The comparison results show that the running time of the K-means and FCM algorithm based on Spark platform is significantly reduced than that based on Map Reduce without compromising the quality of clustering results,and both the speedup and scaleup are larger than the parallel algorithm based on Map Reduce with the increase of the number of nodes,which shows that the Spark platform has advantages over the Hadoop platform in processing iterative clustering algorithms.At the same time,it improves the efficiency,clustering quality and scalability of the algorithm.In summary,the clustering algorithm based on the Spark platform can provide more efficient cluster analysis for big data.
Keywords/Search Tags:Big Data, Clustering Analysis, K-Means Algorithm, Fuzzy C-Means Algorithm, Hadoop, Spark
PDF Full Text Request
Related items