Font Size: a A A

Research And Design Of Parallel K-prototypes Clustering Algorithm Based On Hadoop

Posted on:2015-10-06Degree:MasterType:Thesis
Country:ChinaCandidate:X LiFull Text:PDF
GTID:2298330452953323Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the exponentially increase of data in the internet era, how to extractvaluable information from the massive data is an increasingly serious problem in thefield of data mining. Clustering algorithm, which is able to classify the data and thento process it in different methods, is one of data mining algorithms which is widelyused. Hadoop provides a distributed storage architecture and programming modelwhich make it suitable to process massive data. In this paper, multidimensionalclustering properties of clustering algorithm in the context of big data has been deeplystudied based on Hadoop.Firstly, the principle of Hadoop and underlying storage platform has beendiscussed. Based on the self-characteristics of Hadoop and features of algorithms, apre-processing software architecture of data has been proposed so as to guaranty theaccuracy of data. Secondly, on the basis of that, PK-prototypes algorithm based ondistributed Map/Reduce programming model has also been put forward to processlarge multi-dimensional mixed data. Finally, through the deep algorithm analysis, animproved algorithm called PK-prototypesBAW has been launched. Comparing withother algorithms, the result shows the improved algorithm with a high effectiveness.For the optimization of algorithm’s efficiency, the specific and feasibleoptimization method has been given in the different aspect of algorithms to improvethe shortcomings of algorithms and Hadoop platform. The methods include as follow:the initial centering file of clustering has been determined according to the sampleselection algorithm, the files with low volume with which the Hadoop is not good atdealing has been processed, the pressure in the Reduce stage has been lightened in theMap stage, effective compressing algorithm adapted to reduce the amount of datatransfer, the transmitting protocol between nodes has been improved so as to make itmore suitable to work at the Hadoop distributed environment.With the rapid development of e-commerce, the transaction data it produced ismassive level. According to the results of data clustering analysis, classifying thecustomers with the different shopping habits. In order to achieve maximum benefits,different promotion methods will be adapted with each class of customers. Therefore,PK-prototypes algorithm has a broad application prospects.
Keywords/Search Tags:Cloud computing, Hadoop, Data mining, K-as clustering algorithm
PDF Full Text Request
Related items