Font Size: a A A

Application And Research On Clustering Algorithm In Large Scale Datasets

Posted on:2015-03-26Degree:MasterType:Thesis
Country:ChinaCandidate:K Y ShengFull Text:PDF
GTID:2298330431990230Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Along with the rapid development and increasing popularization of informationtechnology, size of the data accumulated by researchers in different field also increases greatly.Particularly, in recent years, with the rising popularity of such concepts as informationexplosion and big data, how to extract useful information from large scale datasets hasbecome an important research focus. Subsequently, the technology of data mining emerges,which is employed to extract potential information from large scale datasets.As an important branch in the field of data mining, clustering analysis has been widelyemployed in the field of data analysis, image processing, pattern recognition, and so on.However, with the rapid growth of data size, the classical algorithms for clustering analysismay have a limitation in terms of the effectiveness or efficiency. Therefore, the research ofhow to apply classical algorithms to large scale datasets becomes particularly important.Aiming at such drawbacks, a further research is conducted in the aspect of data sampling,which consists of what is listed below.Firstly, considering the fact that the simple random sampling algorithm usually causesthe loss of small clusters when dealing with unevenly distributed datasets, on the basis of adensity biased sampling algorithm based on grid, a new variable grid division algorithm isproposed. With the new grid division algorithm, a new density biased sampling algorithmbased on variable grid division is proposed. Experimental results have shown that theproposed algorithm for variable grid division could not only construct a grid that matches thedistribution of original dataset but also set the value of related parameters automatically.Furthermore, the proposed density biased sampling algorithm based on variable grid divisioncan achieve higher quality than simple random sampling algorithm and consumes lesssampling time comparing with the density biased sampling algorithm based on grid.Secondly, a further research is conducted on how to improve the practicality of thedensity biased sampling algorithm based on variable grid division. Through adding theproposed algorithm into WEKA, which is a popular software for data mining, the clusteringanalysis of a large scale dataset composed of the location information from a website isimplemented successfully and effectively. The experimental results show that when dealingwith the practical problems, the proposed density biased sampling algorithm based onvariable grid division also has its advantages. Comparing with other sampling algorithms inWEKA and the Scalable-EM algorithm proposed by Microsoft Research, the proposedalgorithm can improve the representativeness of sample dataset and reduce the total time forclustering. Finally, the clustering analysis of large scale datasets can be implementedeffectively and accurately.
Keywords/Search Tags:variable grid division, density biased sampling, large scale datasets, clustering analysis
PDF Full Text Request
Related items