Font Size: a A A

An K-modes Clustering Algorithm Based On Dynamic Weight

Posted on:2021-01-06Degree:MasterType:Thesis
Country:ChinaCandidate:D D LiuFull Text:PDF
GTID:2428330611950558Subject:Computational Mathematics
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology,the scale of data is increasing,and the type of data becoming more complex.The handling of collection and processing of these massive data has are facing higher requirements.In this paradigm,data mining technology emerged and is widely used in various industries.Clustering analysis is one of the most important branches of data mining,which is a technique that divides data sets based on the similarity metric.As data types become more diverse and complex,clustering analysis also requires the ability to handle a variety of different types of data.In the recent studies,many results have been achieved in the clustering analysis of numerical data,but the actual databases and large data sets include not only numerical data,but also a large number of categorical attribute data,such as biological information data,epidemic prevention and control data,etc.Since categorical data do not have the inherent geometric properties of numerical data,numerical clustering algorithms are not fully suitable for dealing with categorical attribute data.Therefore,it is necessary to study clustering methods for categorical attribute data moreover.In this dissertation,the concept of cluster analysis is introduced in detail.Many data structures,similarity metrics,and objective functions which are commonly used in the cluster analysis are also presented respectively.Furthermore,the various existing developed or improved k-modes algorithms are analyzed in this paper,the results showed that still has some deficiencies in it,such as the definition of the difference metric and the initial center selection.In response to the above issues,the following work has been carried out in this dissertation:(1)An interdependent weighted distance measurement is proposed.The dependency correlation matrix between attributes is established based on the co-occurrence of information,and the interdependence between attributes is considered.There are two improvements were made in calculating the degree of difference between each object attribute values.Firstly,the 0-1 matching difference method is used to reflect the similarities and differences between each other.Secondly,Weighted for the measurement of dissimilarity,which using the interdependence correlation matrix between attributes,this method will reflect the degree of influence of other attributes on that attribute.Therefore,the distance between each two object attribute values is composed of internal distance and external distance.The internal distance is calculated using 0-1 matching degree,and the external distance is calculated through weighted interdependence similarity.(2)An initial center selection method based on dynamic weight density and distance is proposed.The advantages of this method are detailed below.Firstly,the weights and density are adjusted dynamically,which is calculated by the distance of the points to be selected.The weighting coefficient of distance increases with distance,and weight coefficient of density decreases with the distance.This method making the candidate initial center as far away as possible from the initial center of choice,while not losing clusters of data-dense areas,thus the selected center will be distributed more widely.Secondly,the radius of the density is adjusted dynamically.The radius decreases with distance to avoid selecting the points,which are far away from the initial center,has high densities,but has sparse values for each classification attribute.This method will make the select operations have better differentiation.Thirdly,candidate sets are further screened based the outlier factor of the data points.In this step,an improved distance-based outlier detection technique is used to let the data points with larger outlier screened out from the candidate center.This method ensures that the suitable initial center is selected.Computational results show that the improved k-modes algorithm based on the distance metric and initial center selection method which is proposed in this dissertation has more accuracy and precision.Compared to the traditional k-modes and other improved k-modes algorithms,the proposed approach reduces the sensitivity to initial center selection,which proves the effectiveness of the proposed approach.
Keywords/Search Tags:Clustering analysis, k-modes algorithm, Categorical data, Dissimilarity measure, Dynamic weight, Initial center selection
PDF Full Text Request
Related items