Font Size: a A A

High-dimensional Data Clustering Algorithms Based On Active Learning

Posted on:2017-02-12Degree:MasterType:Thesis
Country:ChinaCandidate:J B FengFull Text:PDF
GTID:2348330512976056Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Nowadays,with the development of information technology,the collected data is increasing rapidly in a large number of application areas.These data usually have tens or hundreds or even thousands of dimensions.The general existence of high-dimensional data makes high-dimensional data analysis an important research topic.Clustering analysis is an important task of data mining.However,affected by the“curse of dimensionality”in high-dimensional data,many traditional clustering algorithms are usually unable to cluster effectively on these high-dimensional data.In recent years,researchers have found that there is a Hubness phenomenon in high-dimensional space,and this phenomenon can be used for clustering in high-dimensional data.The characteristic of Hubness phenomenon is:data points with higher Hubness value will be more close to cluster centers(these data are called Hubs),and the higher dimensions of the data,the tendency will become more obvious.In this paper,we study the influence of the Hubness to some existing clustering algorithms,active learning algorithms,and the three specific works as follows:Firstly,we study K-Hub clustering algorithm which is an algorithm applied to high-dimensional data clustering.K-Hub algorithm selects the initial cluster centers randomly.The sickness of K-Hub is that it is sensitive to the choice of initial clustering centers.In order to solve this problem,we propose a novel active learning based K-Hub clustering algorithm,it uses active learning strategy to learn K Transitive Closures Sets.And then selects initial clustering centers from these Transitive Closures Sets.Through this way,we can ensure the initial clustering centers belong to different class.The experimental results showed this method can improve the accuracy of the K-Hub.Secondly,we study an algorithm that learns constraints actively based on ASC.The algorithm uses a so called function ASC to select instances to learn their constraints.It uses shared nearest neighbor distances metric to build a K-NN graph on the dataset and then uses the weight of its edge to calculate the instances' Ability to Separate between Clusters(ASC),then decide select which pair of instance to learn by the value of ASC.But when there are multiple pairs of instances having the same ASC value,this strategy have no idea about which one should be choosen.In view of this situation,we use the Hubness value of the instances which are shared by those pairs of instances to choose instances to learn their constraints.The experimental results proved our improved method can learn more valuable constraints and also have a higher clustering accuracy.Finally,we proposed a two-stage active learning strategy based on Hub on the basis of the algorithm in the last two chapters.The algorithm can select the instances near to the centers of class and the instances beside the classes border both.Through this way,it can improve the accuracy of clustering algorithm.According to the experimental results,this two-stage active learning strategy can improve the clustering accuracy of the clustering algorithm.
Keywords/Search Tags:High-dimensional data, clustering, Hubness, active learning
PDF Full Text Request
Related items