A Study On Clustering Algorithms For Categorical Data With Applications

Posted on:2020-05-13

Degree:Master

Type:Thesis

Country:China

Candidate:K P Xu

Full Text:PDF

GTID:2428330620456745

Subject:Computer application technology

Abstract/Summary:

As an important method of data mining,cluster analysis is widely used in pattern recognition,Web search,image processing and many others.So far,most clustering algorithms focus on numerical data,but in the real world,there are a large number of categorical data,including structured categorical data and unstructured categorical sequence.Due to the discrete value of categorical data,the existing numerical data clustering algorithm cannot be directly applied to categorical data.Therefore,the study of categorical data clustering algorithm has become a very important issue,which is of great significance to both the theory and the application of data mining and cluster analysis.In this thesis,everal important problems in categorical data clustering analysis are studied,including the kernel subspace clustering algorithm for mining the nonlinear relationship between categorical data,the clustering algorithm for categorical sequences,and the robust probability framework for noise data and clustering imbalance data(non-uniform data)in categorical sequences.The main research work of this paper is as follows:1.Aimed at the unrealistic assumption that most clustering algorithms for categorical data are independent of each other and do not take into account the linear or nonlinear correlation between attributes,a kernel subspace clustering algorithm for categorical data is proposed.This algorithm introduces the kernel function of the original work for numerical data to project the categorical data into a kernel space,and the similarity measure of categorical data in the kernel subspace is defined.Based on the measure,the kernel subspace clustering objective function is derived and an optimization method is proposed to solve the objective function.At last,a kernel subspace clustering algorithm for categorical data is proposed,where each attribute is assigned with weights measuring its degree of relevance to clusters,enabling automatic feature selection during the clustering process in the kernel space.We also define a cluster validity index to evaluate the categorical clusters.Experimental results carried out on some synthetic datasets and real-world datasets demonstrate that the proposed method effectively identify the nonlinear relationship among attributes and improves the performance and efficiency of clustering.2.A self-expression model is proposed for categorical sequences.Based on this model,the categorical sequences are transformed into vectors of equal length,and the similarity measure between categorical sequences is defined.The experimental results show that the proposed algorithm not only improves the clustering accuracy,but also reduces the influence of noise data on the clustering results.3.A probability framework for robust clustering of categorical sequence data is proposed.The framework is composed of a self-expression model and a gaussian mixture distribution model,which can not only reduce the interference of noise data on the clustering results,but also perform clustering analysis on the unbalanced data.On this basis,the robust clustering problem of categorical sequences is transformed into a soft subspace clustering problem.Based on this framework,a k-means-type clustering objective function is defined and a robust clustering algorithm for categorical sequences is proposed.Experimental results show that the algorithm has obvious advantages over the current clustering methods in the real-world data set.The above work has enriched the research of categorical data clustering analysis.Among them,the first work has been further extended in the fields of medical diagnosis and animal and plant analysis.The last two are applied to speech recognition,biological information and text mining.Therefore,the work in this paper provides a new technical support for the practical application of data mining,which has great application value in the fields of data mining and knowledge discovery.

Keywords/Search Tags:

Cluster analysis, Categorical data, Nonlinear metric, Subspace clustering, Robust clustering

Related items

1	Studies On Clustering Algorithms For Categorical Data
2	Research And Implementation Of Clustering Method For High Dimensional Categorical Data
3	The Research On Clustering Algorithm For Categorical Data Using Quantum Mechanics
4	Research On Subspace Clustering Algorithm On High-dimensional Categorical Datasets
5	Study Of Algorithms For Clustering Categorical Data
6	Categorical Relation Graph Construction And Clustering Analysis For Categorical Data
7	Research On Robust Subspace Clustering Algorithm And Its Application
8	Research On Enhanced Soft Subspace Clustering Technology
9	Research On Clustering Analysis And Its Applications In Telecom
10	ESCHCD: Entropy-based Algorithm For Subspace Clustering With High Dimensional Categorical Datasets