Font Size: a A A

Application And Research On Clustering Algorithm In Large Scale High Dimensional Datasets

Posted on:2016-10-19Degree:MasterType:Thesis
Country:ChinaCandidate:J DengFull Text:PDF
GTID:2308330464964984Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years, along with the rapid development of cloud computing, internet of things and the social network, cumulative data scale of every domain is increasing rapidly. These mass data potentially contains a large number of useful information. Therefore, how to effectively collect and analyze these data to receive potential information has become the research hotspot and difficulty. As an important method of unsupervised learning in data mining, clustering analysis has been widely used in education, scientific research, the Internet and so on. Although the existing clustering algorithms can obtain higher clustering quality in dealing with small scale and low dimensional data, they may lead to lower clustering validity in dealing with large scale and high dimensional data. Hence, exploring a kind of approach to deal with large scale and high dimensional data clustering has become the key and difficulty. The thesis is based on the data scale reduction and has an intensive research for sampling technique. The main work of the thesis is summarized as follows:(1) Sampling has been widely used in large scale data clustering analysis. In order to overcome the defect of low quality sample by manually set the sample size, the thesis proposes sampling clustering algorithm of self-adaptive sample size on the basis of existing statistical optimal sample size algorithm. The improved algorithm adds the operation of removing the high dimension data redundant features. It can effectively deal with large scale high dimensional data. Extensive experimental results on UCI datasets demonstrate that the sample set of special sample size is obtained by the improved algorithm have higher quality.(2) Although the existing variable grid division density biased sampling clustering algorithm can efficiently deal with large scale data, it can lead to low efficiency in dealing with high dimensional data own to it need to deal with every dimension. Aim to this, the thesis proposes an efficient density biased sampling clustering algorithm. The improved algorithm, firstly, research has been carried out about the importance of different high dimensional data feature in cluster space and an efficient feature selection method for high dimensional data is designed, secondly, combining the method with the variable grid density biased sampling clustering algorithm, so that the improved algorithm can effectively deal with large scale high dimensional data. Extensive experimental results on artificial and UCI datasets demonstrate the effectiveness of the proposed algorithm.(3) In order to further show practical application of the proposed efficient density biased sampling algorithm, adding the algorithm into Weka, which is an open-source data mining platform, and testing among the algorithm with the others sampling algorithms in Weka on the real large scale high dimensional dataset. Experimental results demonstrate the proposed algorithm can obtain higher sample quality and can achieve higher clustering performance in the sample set. As a consequence, the clustering analysis of large scale high dimensional datasets can be implemented effectively.
Keywords/Search Tags:clustering analysis, sampling, variable grid division, feature selection, large scale high dimensional dataset
PDF Full Text Request
Related items