Font Size: a A A

Research And Application Of Clustering Algorithm On The High Dimensional Datasets

Posted on:2018-07-03Degree:MasterType:Thesis
Country:ChinaCandidate:Z P SunFull Text:PDF
GTID:2348330512959261Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In recent years, with the rapid development of information technology, cloud computing and the social network, cumulative data of every domain is high-dimensional and increasing rapidly. These mass data potentially contains a large number of useful information. Therefore, how to effectively analyze these high dimensional data to receive potential information has become the research hotspot and difficulty. As an important method in data mining, clustering analysis has been widely used in education, finance, scientific research, the Internet and so on. Although the existing clustering algorithms can obtain higher clustering quality in dealing with low dimensional data, they may lead to lower clustering validity in dealing with high dimensional data. Hence, exploring a kind of approach to deal with large scale and high dimensional data clustering has become the key and difficulty. The thesis is based on the characteristics analysis of high dimensional data and has an intensive research for attribute reduction and membership function. The main work of the thesis is summarized as follows:(1) Rough K-means algorithm has been widely used in data clustering analysis. In order to overcome the defect of low quality by manually set the numbers of clusters and the inaccuracy description of the data objects, the thesis proposes self-adaptive rough K-means algorithm based on weighted distance on the basis of rough set theory. The improved algorithm adds attribute reduction on the high dimensional data, adjusts membership function and sets weight, then determines the numbers of clusters adaptively. It can effectively deal with high dimensional data. Extensive experimental results on UCI datasets demonstrate that the improved algorithm not only can guarantee efficiency, but also can obviously receive higher accuracy.(2) When the existing clustering algorithms deal with high dimensional data, they are unstable extremely because there exist the Curse of Dimensionality. Some algorithms can not describe high dimensional data effectively. Aim to this, the thesis proposes an improved algorithm for high dimensional data clustering analysis. The improved algorithm, firstly, this algorithm processed by attribute reduction,then calculated the weighted similarity measures of data object function and obtained the similarity matrix, finally, according to the similarity matrix and the threshold value proceeding Condensed Cluster Analysis on the processed data. Extensive experimental results on artificial and UCI datasets demonstrate the effectiveness of the proposed algorithm.(3) In order to further show practical application of the proposed efficient algorithms, adding the algorithm into food safety testing, which is data mining with large scale high dimensional datasets in reality. Experimental results demonstrate the proposed algorithm can achieve higher clustering performance. As a consequence, the clustering analysis of large scale high dimensional datasets can be implemented effectively.
Keywords/Search Tags:clustering analysis, high dimensional datasets, attribute reduction, rough set theory, self-adaptive method, Similarity measure
PDF Full Text Request
Related items