Application And Research On Clustering Algorithm In Large Scale Datasets

Posted on:2015-03-26

Degree:Master

Type:Thesis

Country:China

Candidate:K Y Sheng

Full Text:PDF

GTID:2298330431990230

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Along with the rapid development and increasing popularization of informationtechnology, size of the data accumulated by researchers in different field also increases greatly.Particularly, in recent years, with the rising popularity of such concepts as informationexplosion and big data, how to extract useful information from large scale datasets hasbecome an important research focus. Subsequently, the technology of data mining emerges,which is employed to extract potential information from large scale datasets.As an important branch in the field of data mining, clustering analysis has been widelyemployed in the field of data analysis, image processing, pattern recognition, and so on.However, with the rapid growth of data size, the classical algorithms for clustering analysismay have a limitation in terms of the effectiveness or efficiency. Therefore, the research ofhow to apply classical algorithms to large scale datasets becomes particularly important.Aiming at such drawbacks, a further research is conducted in the aspect of data sampling,which consists of what is listed below.Firstly, considering the fact that the simple random sampling algorithm usually causesthe loss of small clusters when dealing with unevenly distributed datasets, on the basis of adensity biased sampling algorithm based on grid, a new variable grid division algorithm isproposed. With the new grid division algorithm, a new density biased sampling algorithmbased on variable grid division is proposed. Experimental results have shown that theproposed algorithm for variable grid division could not only construct a grid that matches thedistribution of original dataset but also set the value of related parameters automatically.Furthermore, the proposed density biased sampling algorithm based on variable grid divisioncan achieve higher quality than simple random sampling algorithm and consumes lesssampling time comparing with the density biased sampling algorithm based on grid.Secondly, a further research is conducted on how to improve the practicality of thedensity biased sampling algorithm based on variable grid division. Through adding theproposed algorithm into WEKA, which is a popular software for data mining, the clusteringanalysis of a large scale dataset composed of the location information from a website isimplemented successfully and effectively. The experimental results show that when dealingwith the practical problems, the proposed density biased sampling algorithm based onvariable grid division also has its advantages. Comparing with other sampling algorithms inWEKA and the Scalable-EM algorithm proposed by Microsoft Research, the proposedalgorithm can improve the representativeness of sample dataset and reduce the total time forclustering. Finally, the clustering analysis of large scale datasets can be implementedeffectively and accurately.

Keywords/Search Tags:

variable grid division, density biased sampling, large scale datasets, clustering analysis

PDF Full Text Request

Related items

1	Application And Research On Clustering Algorithm In Large Scale High Dimensional Datasets
2	Research And Application On Grid Clustering Method Based On Density For Large-scale And Cluster Intersecting Data
3	Research On Adaptive Varied Density Clustering Algorithm Based On DBSCAN
4	Research On Local Outlier Detection Algorithm
5	Performance Optimization Of Interactive Visual Analysis Of Large-scale Graph Data
6	Research On Data Stream Clustering Algorithm Based On Density Grid
7	Research On Spectral Clustering Methods For Large Scale Datasets
8	Multi-Density Clustering And Outlier Recognition Algorithm Based On Grid Adjacency Relation
9	Large-scale structure of the universe: A clustering analysis
10	Research On The Large Scale Clustering And Its Applications On Anomaly Detection