Font Size: a A A

Research On Clustering Algorithms Based On Metric Learning For Complex Data

Posted on:2020-06-06Degree:MasterType:Thesis
Country:ChinaCandidate:R N LiuFull Text:PDF
GTID:2428330578467721Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Clustering analysis is an important direction in the field of data mining.Its main goal is to discover the implicit class structure in data and divide the data into different clusters or classes,and then it leads to the large similarity between objects in the same class and the small similarity between objects in the different classes.The similarity measure functions based on metric learning theory are one of the key techniques of clustering analysis.With the development of measure methods and clustering techniques,many researchers at home and abroad have proposed many clustering algorithms based on different similarity measures.However,for the massive high-dimensional complex data,the existing measures and clustering analysis techniques only consider the spatial structure of the samples,and ignore the correlation between samples,which results in the low classification accuracy and the time-consuming cost.In this paper,the metric learning theory is introduced to improve the similarity measures in the clustering analysis methods,and the clustering algorithms are combined with the dimension reduction algorithms of data to process the complex data.The experimental results and analysis demonstrate the effectiveness of the proposed algorithms.The main work of this degree thesis is summaried as follows:(1)The traditional affinity propagation clustering algorithm only considers the spatial structure of the sample when processing high-dimensional complex data,which may result in misclassification.The algorithm is esay to fall into local oscillation and unable to converge in the iterative process,which may reduce the clustering effect.To solve the above problems,in this paper,a new hybrid kernel function-based affinity propagation clustering method with locally linear embedding was proposed for classification on complex datasets.First,the LLE algorithm was introduced to reduce the dimension by maping the high-dimensional dataset to the low-dimensional space.Second,a new Global kernel was defined with the high generalization ability.Meanwhile,a hybrid kernel function linearly combining the proposed Global kernel and the Gaussian kernel was defined to further enhance the learning ability of the global kernel.Moreover,the novel hybrid kernel was introduced to define a similarity measure and construct a similarity matrix of the affinity propagation clustering.Then,a damping factor ? was introduced to iterations to overcome the problems of oscillating and failing to converge.Finally,the improved affinity propagation clustering algorithm was implemented on several gene expression datasets and standard UCI datasets for comparison with other related algorithms.Experimental results validate that our proposed clustering algorithm is indeed efficient in clustering accuracy,and outperforms currently available approaches with which it is compared.(2)To solve the problems that the computational methods of the local density and the distance measure are simple and easily ignore the correlation and similarity between samples,and the setting of parameters has a great influence on the clustering results;this paper presented an adaptive DPC algorithm with Fisher linear discriminant for the classification of complex datasets.Firstly,the kernel density estimation method was introduced to calculate the local density of the sample points.Pearson correlation coefficient between samples as weight was employed to construct a weighted Euclidean distance for measuring the distance between samples,which considers both the spatial structure and the correlation of the samples.Then,new density estimation entropy was proposed,and based on the minimization of density estimation entropy,the density estimation parameters were adaptively selected to optimize the cutoff distance,which can efficiently eliminate the influence of manual settings.Thirdly,an automatic selection stagey of cluster centers was designed to avoid the error caused by the noise data as the cluster centers and the uncertainty of manually selecting the cluster centers.Finally,Fisher linear discriminant algorithm was used to eliminate the irrelevant information and reduce the dimensions of high-dimensional complex data,and an adaptive DPC algorithm is implemented on six synthetic datasets,thirteen UCI datasets and seven gene expression datasets for comparison with other related algorithms.Experimental results on twenty-six datasets show that the proposed algorithm can accurately select the cluster center and significantly outperforms several currently outstanding clustering approaches in terms of the clustering accuracy and efficiency.(3)In order to solve the shortcomings that the traditional biclustering algorithm can not accurately find the overlapping biclusters,the consistent volatility effect is poor when dealing with high-dimensional complex data,an improved rough fuzzy biclustering algorithm based on rough average square residue was proposed.For high-dimensional complex datasets,the missing values were filled firstly.And a nonnegative matrix factorization method was used to reduce the dimension,eliminate redundant features,and obtain effective feature subsets.Then,the rough set theory was combined with fuzzy biclustering algorithm to obtain much larger volume of biclustering,and the upper and lower approximation theory of rough set was introduced to define a new rough average square residue,and an improved rough fuzzy biclustering algorithm is constructed.Then,the comprehensive evaluation metric function and approximation degree principle were introduced to delete or add rows and columns of matrix for obtaining a larger volume of biclustering results.Finally,simulation experiments are carried out on several high-dimensional complex datasets to test the effectiveness of the rough fuzzy biclustering algorithm.
Keywords/Search Tags:Metric learning, similarity measure, affinity propagation clustering, density peak clustering, biclustering clustering
PDF Full Text Request
Related items