Font Size: a A A

Research Of Clustering Mining Algorithm Oriented Big Data

Posted on:2016-04-12Degree:MasterType:Thesis
Country:ChinaCandidate:Y L WangFull Text:PDF
GTID:2308330473965501Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The great potential value of big data prompts big data mining technology to generate, Big Data mining is the data processing which mines valuable knowledge from the data source charactering with volume, velocity and variety. How to accurately and quickly mine valuable knowledge from big data is a hot research topic.This thesis focuses on the research of big data clustering algorithms, the research objective is to improve the accuracy and efficiency of clustering algorithms. Firstly, the accuracy is improved by improving traditional clustering algorithms, and then to improve efficiency through the improved clustering algorithm parallelization.This thesis presents a Density-based Incremental k-means clustering algorithm, named DBIK-means, which bases on DBSCAN algorithm and k-means algorithm. DBIK-means algorithm firstly calculates the density of data points, then combines the center point which has a density greater than a given threshold value and others point which in the density range of the center point to build basic clusters; then merges two basic clusters according to the distance between their center points; finally, divides point which is not belong to any cluster into its nearest cluster. Theoretical analysis and experimental results on KDD CUP 99 dataset show that this algorithm can find clusters of arbitrary shape, and is not sensitive to parameters and the input order of data points. It can get higer clutering accuracy with a little additional time cost. Its overall performance is better than k-means clustering algorithm.In order to improve the efficiency of DBIK-means algorithm, reduce the time complexity of the algorithm, this thesis uses distributed database to simulate shared memory space, and then makes DBIK-means algorithm parallelization in the cloud computing platform of Hadoop; the experimental results show that DBIK-means is suitable for clustering mining of large dataset.Finally, the DBIK-means algorithm is applied to the classification of telecom customers, application result shows that the DBIK-means algorithm can automatically classify a large number of telecommunications customers into several clusters more accurately than traditional clustering algorithm, it’s helpful for telecom operators to develop different marketing strategies for different types of customers.
Keywords/Search Tags:Big Data, Clustering Mining, K-means, Cloud Computing, Hadoop
PDF Full Text Request
Related items