Research On Clustering Algorithms For The Data With Multidimensional Mixed Attributes

Posted on:2014-02-27

Degree:Doctor

Type:Dissertation

Country:China

Candidate:J C Ji

Full Text:PDF

GTID:1228330395496607

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Based on the intrinsic characteristic or similarity of objects, organizing these objects intosensible groups is one of the most fundamental modes of learning and understanding. Clusteranalysis the rearch of the approaches and algorithms which partition or allocate objects intosensible groups. With the rapid development in information technology and data collectionand storage device, almost all aspects of human society produce and store a lot of data, andthe number and variety of data continue to grow fast. For example, worldwide businessesgenerate large volume data, including sales transactions, stock trading records; scientific andengineering practices generate many data from remote sensing, process measuring, scientificexperiments, engineering observations, and environment testing; social media such as blogs,podcasts, Wikipedia, forums, social networks, micro-blog, Twitter, has become increasinglyimportant data source.The availability and explosive growth of data has inspired the generation and developmentof data mining or knowledge discovery which can be automatic or convenient extractingknowledge from data. Clustering analysis is an important technology in data mining orknowledge discovery, its purpose is to explore the potential structure hidden in the data. Thistechnique is widely used in customer segmentation, web search, privacy protection,bioinformatics etc.The traditional clustering algorithm is mainly designed for the data objects only withnumeric or categorical attributes. More and more research suggests that existing data sets aremostly described by both numerical and categorical attributes. Since these two types ofattributes has great difference in the range, characteristics and distribution of values, manyresearchers believe that the traditional clustering algorithms designed for numeric orcategorical data are may not suitable for processing mixed attribute data. Designing thealgorithm for the data with both numeric and categorical attributes therefore is one of the mostattractive research issues in clustering analysis. In this paper, we investigate this researchissue. Our research work mainly includes the following four aspects:1) Based on the W-k-means framework, a new clustering algorithm (IWKM) is proposedin this paper. In IWKM algorithm, the distribution centroid is first introduced to representthe center of cluster with categorical attributes; then distribution and mean is combined torepresent the center of cluster with mixed numerica and categorical data; exploit a newdissimilarity measure which takes into account the influence of different attributes inclustering process to evaluate the distance between data objects and the center of cluster.In addition, the IWKM algorithm uses the weight strategy in the W-k-means frameworkto assess the influence of attribute. The performance of the proposed method isdemonstrated by a series of experiments on real world datasets in comparison with that oftraditional clustering algorithms.2) Aweighted fuzzy k-prototypes algorithm (WFK-prototypes) is proposed in this paper.In this algorithm, the idea of fuzzy set and fuzzy clustering was introduced to deal withthe fuzzy nature of data objects; integrating fuzzy centroid with mean to represent thecenter of cluster with mixed numeric and categorical data, and this new representationcan capture the distribution information of both numeric and categorical attribute values;utilize the co-occurence of attribute values to calculate the impact of attribute in clustering process. The performance of the proposed method is demonstrated by a seriesof experiments on real world datasets in comparison with that of traditional clusteringalgorithms.3) An improved KH algorithm (IKH) is proposed to deal with the issue of clusteringmixed data in this paper. In the fuzzy clustering designed for mixed data, no matter howfar it is away from the center of cluster, the every data object will influence the all cluster.By introduction of KH’s framework, the IKH algorithm can avoid this deficiency. In theIKH algorithm, we first combine means and fuzzy centroid to represent the center ofcluster with mixed attributes; and utilize the new dissimilarity measure which use a newnormalize factor to assess the distance between data objects and center of cluster withmixed attribute.The performance of the proposed method is demonstrated by a series ofexperiments on real world datasets in comparison with that of traditional clusteringalgorithms.4) A new method for initialization of centers of cluster (DDCI) for mixed data isproposed in this paper. In partition algorithm, the result of clustering is dramaticallyinfluenced by the initial place of cluster centers. So far, there are many works deal withthis issue for numeric or categorical data. However, as for as we know, all the partitionalgorithms designed for mixed data exploit the random method to initialize the center ofcluster. Thus, cluster centers initialized by random approach result in unstable outcome ofclustering and the results of clustering cannot be repeated. To deal with the initializationissue for mixed data, we proposed the DDCI approach by considering the idea aboutdensity and distance. In the approach DDCI, for the mixed numerica and categoricalattribute data, we introduce the notion of density to evaluate the coherence of data objectin data set, and then combined with the density and distance to select the initial clustercenter. The performance of the proposed method is demonstrated by a series ofexperiments on real world datasets in comparison with that of traditional clusteringalgorithms.

Keywords/Search Tags:

Cluster analysis, data mining, mixed data, cluster centers initialization

PDF Full Text Request

Related items

1	Study On Partitioning Clustering Algorithms Based On Mixed Data
2	Research On Advertisement Recommendation System Based On Data Mining
3	Research On Partitional Clustering Algorithms For Mixed Data
4	Based On The Application Of Cluster Analysis Of Water Pollution Monitoring System
5	Research And Application Of K-means Clustering Algorithm
6	Algorithms Implementation Of Determining The Number Of Clusters And Initial Cluster Centers For Mixed Data
7	Research On Partitioning Clustering Algorithms For Data With Mixed Numerical And Categorical Attributes
8	The Application Of Cluster Analysis Algorithm In HMIS
9	Cluster Sowntown And Appliction Study Based On Least Cluster Cell
10	Cluster Analysis In Data Mining And Its Control In Applied Research