Font Size: a A A

Research On Clustering Algorithms For The Data With Multidimensional Mixed Attributes

Posted on:2014-02-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:J C JiFull Text:PDF
GTID:1228330395496607Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Based on the intrinsic characteristic or similarity of objects, organizing these objects intosensible groups is one of the most fundamental modes of learning and understanding. Clusteranalysis the rearch of the approaches and algorithms which partition or allocate objects intosensible groups. With the rapid development in information technology and data collectionand storage device, almost all aspects of human society produce and store a lot of data, andthe number and variety of data continue to grow fast. For example, worldwide businessesgenerate large volume data, including sales transactions, stock trading records; scientific andengineering practices generate many data from remote sensing, process measuring, scientificexperiments, engineering observations, and environment testing; social media such as blogs,podcasts, Wikipedia, forums, social networks, micro-blog, Twitter, has become increasinglyimportant data source.The availability and explosive growth of data has inspired the generation and developmentof data mining or knowledge discovery which can be automatic or convenient extractingknowledge from data. Clustering analysis is an important technology in data mining orknowledge discovery, its purpose is to explore the potential structure hidden in the data. Thistechnique is widely used in customer segmentation, web search, privacy protection,bioinformatics etc.The traditional clustering algorithm is mainly designed for the data objects only withnumeric or categorical attributes. More and more research suggests that existing data sets aremostly described by both numerical and categorical attributes. Since these two types ofattributes has great difference in the range, characteristics and distribution of values, manyresearchers believe that the traditional clustering algorithms designed for numeric orcategorical data are may not suitable for processing mixed attribute data. Designing thealgorithm for the data with both numeric and categorical attributes therefore is one of the mostattractive research issues in clustering analysis. In this paper, we investigate this researchissue. Our research work mainly includes the following four aspects:1) Based on the W-k-means framework, a new clustering algorithm (IWKM) is proposedin this paper. In IWKM algorithm, the distribution centroid is first introduced to representthe center of cluster with categorical attributes; then distribution and mean is combined torepresent the center of cluster with mixed numerica and categorical data; exploit a newdissimilarity measure which takes into account the influence of different attributes inclustering process to evaluate the distance between data objects and the center of cluster.In addition, the IWKM algorithm uses the weight strategy in the W-k-means frameworkto assess the influence of attribute. The performance of the proposed method isdemonstrated by a series of experiments on real world datasets in comparison with that oftraditional clustering algorithms.2) Aweighted fuzzy k-prototypes algorithm (WFK-prototypes) is proposed in this paper.In this algorithm, the idea of fuzzy set and fuzzy clustering was introduced to deal withthe fuzzy nature of data objects; integrating fuzzy centroid with mean to represent thecenter of cluster with mixed numeric and categorical data, and this new representationcan capture the distribution information of both numeric and categorical attribute values;utilize the co-occurence of attribute values to calculate the impact of attribute in clustering process. The performance of the proposed method is demonstrated by a seriesof experiments on real world datasets in comparison with that of traditional clusteringalgorithms.3) An improved KH algorithm (IKH) is proposed to deal with the issue of clusteringmixed data in this paper. In the fuzzy clustering designed for mixed data, no matter howfar it is away from the center of cluster, the every data object will influence the all cluster.By introduction of KH’s framework, the IKH algorithm can avoid this deficiency. In theIKH algorithm, we first combine means and fuzzy centroid to represent the center ofcluster with mixed attributes; and utilize the new dissimilarity measure which use a newnormalize factor to assess the distance between data objects and center of cluster withmixed attribute.The performance of the proposed method is demonstrated by a series ofexperiments on real world datasets in comparison with that of traditional clusteringalgorithms.4) A new method for initialization of centers of cluster (DDCI) for mixed data isproposed in this paper. In partition algorithm, the result of clustering is dramaticallyinfluenced by the initial place of cluster centers. So far, there are many works deal withthis issue for numeric or categorical data. However, as for as we know, all the partitionalgorithms designed for mixed data exploit the random method to initialize the center ofcluster. Thus, cluster centers initialized by random approach result in unstable outcome ofclustering and the results of clustering cannot be repeated. To deal with the initializationissue for mixed data, we proposed the DDCI approach by considering the idea aboutdensity and distance. In the approach DDCI, for the mixed numerica and categoricalattribute data, we introduce the notion of density to evaluate the coherence of data objectin data set, and then combined with the density and distance to select the initial clustercenter. The performance of the proposed method is demonstrated by a series ofexperiments on real world datasets in comparison with that of traditional clusteringalgorithms.
Keywords/Search Tags:Cluster analysis, data mining, mixed data, cluster centers initialization
PDF Full Text Request
Related items