| As an important technology of data mining,cluster analysis has gradually become an interdisciplinary and cross-domain data analysis method.Traditional clustering is an unsupervised analysis method.Semi-supervised clustering is mainly to integrate a small amount of prior knowledge into the clustering process to obtain better clustering results.This paper takes semisupervised clustering as the theme,and evaluates and analyzes the effectiveness of clustering.The main research contents are as follows:(1)A large number of different types of semi-supervised clustering algorithms have been proposed and evaluating the effectiveness of their algorithms has become one of the important research contents of semisupervised learning.However,existing evaluation methods mainly take unsupervised clustering results as the base line to evaluate the effectiveness of semi-supervised results without fully to consider the impact of supervised information on clustering results.In order to evaluate the effectiveness that the result of semi-supervised clustering,we propose two base-lines based on similarity and random merging from the aspects of whether the supervised information participates in the clustering process and whether it improves the clustering results.They take the basic assumption that the clustering result does not violate the supervised information,and observe the quality of the clustering result on the basis of it.In the experimental analysis,8 semi-supervised clustering algorithms on 8 data sets were evaluated,and the corresponding unsupervised clustering algorithm was compared as the bottom line,demonstrating the effectiveness and feasibility of the new baseline.(2)Since the performance of semi-supervised clustering is affected by supervised information,different supervised information often brings different clustering results.It is generally believed that the result of semi-supervised clustering always improves with the increase of supervised information.However,experiments have proved the unreasonableness of this statement.Therefore,in order to better study the influence of the quality that constraint set of the semi-supervised clustering algorithm,the paper proposes a quality evaluation method of constraint set based on core edges.According to the nature of the paired constraints,this method generates the must-link constraint closure and the cannot-link constraint closure from the constraint set,and proposes a metric based on the core edge.The paper argues that the more the number of core edges,the better the quality of the constraint set.The paper evaluates the quality of constraint set of 5 classic semi-supervised clustering algorithms on 8 benchmark data sets.The experimental results show that the proposed method can effectively evaluate the quality of the constraint set of the semi-supervised clustering algorithm.(3)Aiming at the lack of benchmarks in semi-supervised clustering algorithms to evaluate the quality of semi-supervised clustering algorithms,a semi-supervised clustering effectiveness evaluation system is proposed.The system introduces the role of each module in detail,and displays the basic information of the data set and the basic information of the algorithm.The system displays the evaluation results of semi-supervised clustering in the form of scatter plots and histograms.Aiming at the problem of evaluating the effectiveness of semi-supervised clustering algorithm,the paper studies the clustering results and the quality of the constraint set.The proposed method further guides the research direction of semi-supervised clustering and provides effective technical support for semi-supervised clustering evaluation. |