Font Size: a A A

Research Of Clustering Algorithms For Mixed Data Based On Attribute Weighting And Similarity Measuring

Posted on:2011-04-28Degree:MasterType:Thesis
Country:ChinaCandidate:J WanFull Text:PDF
GTID:2178330338476270Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Clustering analysis is an important branch of the research on data mining, the task is to efficiently partition large data sets into a series of clusters such that the objects belonging to the same cluster are similar. With the effective interpretation of the clusters, we can often identify the interesting groups. The K-Means algorithm is the most frequently used clustering method and is promising for its efficiency in processing large data sets. However, its use is only limited to numeric data. Based on K-Means algorithm, the K-Modes and K-Prototypes algorithm are proposed to work on the categorical and mixed data respectively. Due to the irrationality of the similarity measure and weight calculation, the precision of the clustering can't be guaranteed.Concept hierarchy (CF) tree is a hierarchical structural semantic description of the attributes and can be used to measure the similarity of categorical attributes. With encoding, the traditional measures calculate the conceptual distance which to a certain extent reflects the difference between the attributes'value. Based on the CH tree, this paper abandons the traditional encoding methods and directly uses the tree structure. The conceptual distance is replaced by the distance of the tree nodes, which can overcome the shortcoming of information loss caused by encoding. New measure is not only intuitively reasonable, but also consistent with the characteristics of the metric space.ReliefF is a common feature selection algorithm with high execution efficiency. Based on the idea of overall consideration proposed by A.Ahamd , the improvement is focused on the attribute differences measurement which makes the ReliefF algorithm more accurately evaluate the importance of the attribute and assign the relevant value. Meanwhile, by combining the overall consideration and graph clustering theory, this paper transform the information system of the data sets into weighted graph and calculate the similarity of the attributes using the connection degree between the nodes of the graph. This approach preserves the rationality of overall consideration, while reducing the computational complexity.Mixed data contains two completely different types of attributes and therefore makes the clustering difficult. This paper mainly discusses problems on the level of attributes'importance and the distance contribution of the attributes which are frequently encountered in the mixed data clustering. First by numeric attribute discretization, the attribute importance can be measured and evaluated on the entire data set. Then the new similarity measurement is operated and finally we have experiments on 3 types of data sets. Comparisons with the traditional clustering methods illustrate the efficiency and effectiveness of the new method.
Keywords/Search Tags:clustering analysis, CF tree, feature selection, similarity measurement, mixed data
PDF Full Text Request
Related items