Font Size: a A A

The Research Of Clustering Mining Based On Logistics History Data On The Hadoop

Posted on:2018-04-24Degree:MasterType:Thesis
Country:ChinaCandidate:J SuFull Text:PDF
GTID:2348330533960325Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development and application of a series of new technologies,such as electricity supplier,Internet of things,cloud computing,etc.,the data growth of logistics industry is no longer linear and slow,and it presents a massive,complex,real-time and explosive.Clearly,the traditional stand-alone storage and serial data mining technology are unable to meet the needs of large data processing in the current logistics industry.Hadoop has become a new trend of social development and an open source distributed platform for the distributed computing of larger data sets.In recent years,this technology has gradually played its unique advantages in the field of data mining.However,the K-means clustering algorithm is an effective algorithm for large data mining.The algorithm is simple and easy to implement,but it still has great blindness and unpredictability in the selection of K value and its centroid point.And it is easy to make the clustering results fall into local optimum.It also has a lot of redundant computations,slow convergence rate,low clustering accuracy and the lack of parallelism and expansibility in the distance calculation,which greatly reduce the operating efficiency of the algorithm.According to the insufficiency of traditional K-means algorithm and combinations of the advantages of “Distance Triangle Inequality Principle” and “Min-Max Principle”,it proposes an improved Canopy-Kmeans algorithm based on double MapReduce distributed programming model on the Hadoop cloud computing platform.And the accuracy of the algorithm is verified by the real historical data of She Fa Logistics Company.The specific researches are as follows:Firstly,it elaborates the Hadoop ecosystem in detail and analyzes its basic components,construction modules and working mechanism deeply in this thesis.It also analyzes the standard flow in the large data mining process.The traditional design and process of K-meansalgorithm are deeply researched,and the advantages and disadvantages of the existing research results are explored emphatically.Secondly,in order to optimize the selection of K value,the traditional Canopy algorithm is improved on the Hadoop platform based on the min-max principle.It successfully solves the blindness of artificial located K value and the area radius T1 and T2 in the traditional Canopy algorithm,this providing a reliable theoretical basis for the accuracy of K-means clustering results.Thirdly,in order to solve the traditional redundant computation of K-means algorithm in the iterative process,based on the advantages of triangular inequality theory,it increases the distance selection decision before the iterated K-means algorithm,thus reducing the redundancy computation.What is more,on the basis of a weighted clustering criterion function,it adds the convergence judgment in order to make further improvement of the efficiency of the algorithm.Then it improves the quality and convergence rate of clustering and reduce the misclassification rate of data objects.Finally,an improved Canopy-Kmeans algorithm based on dual MapReduce programming model is designed and implemented.In order to validate the feasibility of this algorithm,a Hadoop cluster environment is set up,and a lot of experiments are carried out by finding the key customer groups of She Fa Logistics Company.The experimental results show that the designed parallel algorithm has significantly improved in the clustering results of accuracy,speedup,scalability and other aspects.It solves the problems of K value and the selection of Canopy centroid successfully avoids the redundant distance calculation in iterative process and improves the convergence speed of the original algorithm.In addition,the more size of data and the more nodes,the greater effect of the improvement.
Keywords/Search Tags:Hadoop, Canopy-Kmeans, Data Mining, Double MapReduce
PDF Full Text Request
Related items