Font Size: a A A

Research And Application Of Clustering Mining Algorithm Oriented Big Data Based On MapReduce

Posted on:2019-06-11Degree:MasterType:Thesis
Country:ChinaCandidate:Y HuFull Text:PDF
GTID:2428330596964799Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet,information presents an explosive growth trend.Faced with massive data sets,how to use existing data to find useful information and maximize the potential value of data has become a common concern in academia and industry.Traditional clustering algorithms are so hard to meet the needs of massive data sets that people urgently need new data mining algorithm frameworks.The development of parallelization framework provides a new direction for big data clustering algorithm.MapReduce is a distributed programming framework developed by the Apache foundation,which is used for parallel operations of large data sets.Instead of expensive hardware devices,it provides data storage and algorithm parallel computation.The integration of MapReduce framework and clustering algorithm has become a research hotspot in the field of data mining.Firstly,this paper studies and improves the canopy algorithm,measures the optimal cluster number of the clustering algorithm through the gradient value,proposes a method to dynamically change the radius of the region,and proposes a improved two-phase CTK clustering algorithm based on MapReduce in combination with the distributed framework.Secondly,aiming at the problem of uneven data distribution in the parallelization framework,a space-oriented data distribution clustering is proposed.Finally,the improved algorithm is applied to the analysis of web logs,and a website log analysis model under the background of big data is established.The main work of this paper is as follows:1.Introduce an improved algorithm of distributed clustering based on MapReduce.the process of clustering will be divided into two stages.firstly,improve Canopy algorithm,find out the suitable K of clustering algorithm by the change of gradient value,which will reduce the numbers of iterations and avoid the uncertainty of initial center point results in.Then put the center point and the number of clusters as the input parameters of second stage,dynamically changing the radius of the region to avoid the algorithm falls into local optimal.Finally,the parallel strategy of algorithm are designed according to the MapReduce framework.2.This paper analyzes and discusses the distribution of cluster data under the background of big data,and puts forward a clustering optimization algorithm for spatial data distribution.Carrying out secondary clustering on the clustering center reduces the data transmission from the Map end to the Reduce end;The information entropy weighting strategy is proposed to improve the accuracy of similarity calculation.In the comparison of similarity,the certainty judgment is added to reduce the comparison times of the algorithm,and the algorithm in Combine stage is optimized to reduce the I/O consumption of the cluster.The final experimental results prove the accuracy and stability of the algorithm.3.This paper applies clustering algorithm to log analysis of websites,expounds the common principles of log analysis,and analyzes different schemes of session identification.The user's characteristic parameters are extracted from the session log to calculate the similarity.The improved algorithm is used to cluster the similarity of users and analyze the clustering results.Finally,the experiment proves the rationality of the model and the accuracy of the algorithm,and also provides a reasonable plan for the operation decision of the website.Finally,a summary of the full paper and the further research content to be studied are proposed.
Keywords/Search Tags:big data, clustering algorithm, MapReduce framework, Canopy algorithm, web log analysis
PDF Full Text Request
Related items