Font Size: a A A

Research On Hierarchical Clustering Algorithm Based On Silhouette

Posted on:2011-06-01Degree:MasterType:Thesis
Country:ChinaCandidate:D M ZhangFull Text:PDF
GTID:2178360302494501Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Through analyzing the clustering algorithm situation of foreign and domain, we get the conclusion that many problems exist in the previous clustering algorithms. The finishing parameters need to be fixed in the traditional hierarchical clustering algorithms. The time complexity of the determination for the parameters is high. The existing background knowledge has not been fully utilized in the most of the hierarchical clustering algorithms. Thus, the quality of the clustering result is not good. Besides, the sequence data have been analyzed and applied in few hierarchical clustering algorithms. In order to address the problems, the paper has mainly focused on the research of the hierarchical clustering algorithm based on silhouette. Solving these problems makes significance for life sciences, medicine, social science and geographical science and so on.Firstly, a hierarchical clustering algorithm based on silhouette is proposed. In the algorithm, the number of clusters is determined by incrementally drawing the curve about the mean improving silhouette of the dataset. In the later agglomerative hierarchical clustering phase, entropy, which is considered as the new similarity measurement, is introduced. The outlier clusters is identified by calculating the weighted distance between clusters.Secondly, a hierarchical clustering algorithm based on silhouette and constraint is proposed. The existing pairwise instance-level constraints are incorporated in the proposed algorithm. The existing constraints are utilized for updating the cohesion matrix. Meanwhile, penalty factor is introduced to address the constraint must-link and cannot-link violation problem.Finally, a sequence hierarchical clustering algorithm based on silhouette in software security analysis is proposed. In the proposed algorithm, fault feature matrix is defined to reflect the relation between the fault feature and the corresponding row vector on the premise of existing sequence pattern. Thus, the clustering of sequences can be transformed into the clustering of row vectors. The match scale of software fault feature analysis is reduced through the clustering of existing fault sequence.
Keywords/Search Tags:Hierarchical clustering, Silhouette, K-means, Entropy, Constraint, Sequence
PDF Full Text Request
Related items