| Clustering is an unsupervised learning method that can group similar samples into the same cluster without requiring manually labeled samples.Clustering can be used to discover patterns and structures in data,to naturally group data,and to compress data.Clustering is both a tool for data analysis and data preprocessing.Clustering algorithms are widely used in many fields,such as image processing,bioinformatics,social network analysis,etc.K-means is one of the top ten data mining algorithms and is a wellknown partition-based clustering algorithm.The algorithm is simple and effective,easy to scale,and widely studied and used in various fields.However,the k-means algorithm is vulnerable to the initial selection of cluster centers,which can affect the clustering results and even lead to a local optimum.Therefore,the improvement of k-means initialization algorithm has received the attention of many researchers.In the era of big data,massive data often contains a lot of noise and outliers,so the k-means algorithm is vulnerable to these outliers,and the study of robust k-means problems has emerged.In addition,not all datasets are suitable for using Euclidean distance,and appropriate distance metrics should be selected according to specific problems.This article focuses on the initialization problem of k-means.Regarding the robust k-means problem with outlier points,based on the idea of introducing probability into the initial selection of cluster centers in kmeans++,a seeding algorithm for μ-similar Bregman divergence k-means problem with penalties is proposed based on Bregman divergence,especially μ-similar Bregman divergence.Firstly,this algorithm allows some data points not to be clustered but to pay a penalty,reducing the influence of outliers on the initial cluster centers in the k-means algorithm.Secondly,this algorithm extends the k-means problem from Euclidean distance to more general Bregman divergence and proposes an initialization algorithm for μ-similar Bregman divergence k-means problem.Finally,theoretical analysis is conducted on this algorithm,proving that the performance guarantee of the algorithm proposed in this article only involves the ratio of the maximum penalty value to the minimum penalty value and μ.When data points have the same penalties,the ratio of the maximum penalty value to the minimum penalty value becomes 1.When μis 1,it is the same as the approximation ratio of Euclidean distance.When the punishment is the same,the initialization algorithm proposed in this article has the same performance guarantee as the known k-means++algorithm for the k-means problem.Therefore,the algorithm in this article extends the k-means problem with penalties in Euclidean distance.To verify the effectiveness and applicability of this algorithm,numerical experiments are carried out in this article.Firstly,different Bregman divergences are used to analyze clustering on different types of datasets to verify the influence of different Bregman divergence metrics on clustering performance on different datasets.Secondly,the initialization algorithm of k-means with penalties is used to conduct experiments on datasets with outliers,compared with the k-means++ initialization algorithm,and cluster data analysis is carried out to verify the robustness of the algorithm proposed in this article.Finally,the algorithm proposed in this article is applied to network intrusion detection for application experiments.The experiments show that the algorithm proposed in this article can improve the detection rate of intrusion detection algorithms based on k-means clustering,and the Bregman divergence metric makes it more widely applicable,verifying the practicality of the algorithm proposed in this article. |