Font Size: a A A

Research On Interactive Information Bottleneck Clustering Algorithm For Large-scale High-dimensional Data

Posted on:2021-03-18Degree:MasterType:Thesis
Country:ChinaCandidate:R B WangFull Text:PDF
GTID:2428330602976354Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of information and internet technology,data sets with large scale and high dimensionality are growing exponentially.Due to the influence of"data explosion"and"curse of dimensionality",traditional clustering algorithms are difficult to achieve expecting results when facing large-scale high-dimensional data.Therefore,to meet the requirements of data in practical applications and characteristics of different fields,developing an effective and efficient large-scale and high-dimensional data clustering algorithm has important theoretical significance and applicable value.For clustering analysis of large-scale high-dimensional data,co-clustering algorithms provide a way by clustering the row-wise data points and column-wise features simultaneously,which reveal internal relationships between them and integrate overall information of data,and use correlation between them to improve clustering performance.Existing co-clustering algorithms consider eliminating redundancy or noise by reducing the feature dimensionality,which takes the harmful original features into data clustering and thus weakens the final clustering performance.To address the aforementioned problems,inspired by co-clustering algorithms,we propose an effective Interactive Information Bottleneck(I~2B)clustering algorithm.Compared with existing co-clustering algorithms,I~2B considers dimension-reduced features clustering for data in row direction and uses clustered data points for column-wise feature clustering,by which the satisfactory final clustering result may probably be obtained.Several advantages of this method are as follows:(1)It can obtain effective discriminant features and eliminate harmful redundant or noisy features,which will be conducive to clustering of data after each iteration;(2)Clustered data points can be used as supervisory information to guide feature clustering.To our knowledge,this is the first work addressing this problem in a co-clustering way.Finally,a new twin“draw-and-merge”method is designed and optimized,time complexity of this optimized algorithm is related with the scale and dimension of data linearly,which can process large-scale and high-dimensional data efficiently.Experimental results show that performance of I~2B algorithm is better than the previous original IB algorithms and other traditional clustering algorithms.Compared with state-of-the-art large-scale and high-dimensional data clustering algorithms,I~2B also achieves better stability and higher clustering accuracy.
Keywords/Search Tags:Clustering, Large-scale data, High-dimensional data, Information bottleneck
PDF Full Text Request
Related items