Font Size: a A A

Three-Way Clustering Research Based On Natural Nearest Neighbors

Posted on:2024-04-09Degree:MasterType:Thesis
Country:ChinaCandidate:T F WuFull Text:PDF
GTID:2568307154499344Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Clustering is a widely used unsupervised learning method in fields such as data mining and machine learning,and is one of the fundamental techniques in these areas.Its purpose is to categorize data objects based on their similarity,grouping together data objects with high similarity into the same cluster in order to minimize the similarity between different clusters and maximize the similarity within the same cluster.As one of the important branches of data mining,cluster analysis palys a crucial role in discovering the internal structure of data and extracting information from it.Traditional hard clustering algorithms only use one set to represent a single cluster,which means that each sample can only belong to one cluster,leading to a significant limitation in fully representing the internal structure of the data set.In contrast,soft clustering algorithms solve issues such as overlapping clusters,outliers,and uncertain objects by relaxing the constraints on clustering boundaries,allowing a data sample to be assigned to multiple clusters,and the intersection between two clusters does not necessarily need to be an empty set.Therefore,soft clustering algorithms have a broader prospect in revealing the internal structure of data sets.As a soft clustering algorithm,three-way clustering introduces three-way decision theory for cluster analysis.Unlike traditional clustering methods,the clusteri is no longer a single set,but rather composed of two sets,the core regions and the frigin regions.The core region contains typical objects of the cluster and can determine their belonging to this cluster,while the frigin region contains marginal objects of the cluster that may or may not belong to the cluster.This three-way representation can handle both traditional hard clustering tasks and soft clustering tasks.By parttioning the frigin regions within the cluster,the three-way clustering algorithm resolves the problem of information uncertainty in traditional clustering methods and reduces the decision risk brought about by information uncertainty.In this thesis,we integrate the idea of three-way clustering into the density peak clustering algorithm and ensemble clustering algorithm,and improved by using the natural nearest neighbors.The contributions are as follows:(Ⅰ)To address a difficulty of forming a clear-cut boundary of a cluster,three-way clustering methods search for a new type of cluster structures characterised by a pair of a core region with tightly connected objects and a fringe region with relative loosely connected objects.Density peaks clustering(DPC)algorithm is a non-iterative process and does not require a predetermined number of clusters.It uses the local density and local distance to construct a decision graph,and selects the cluster center according to the decision graph.After the cluster center is determined,the remaining unallocated objects are assigned to the cluster with the closest distance and greater density than it.In this thesis,by taking advantages of these two classes of clustering methods,we propose a new three-way adaptive density peaks clustering(3W-ADPC)method.Based on two improved definitions of local density and local distance,the method adaptively selects the most appreciate neighbor(i.e.,natural nearest neighbor)of each sample and does not need the parameter of a cut-off distance threshold.In other words,3W-ADPC is a parameter-free three-way clustering algorithm.Experimental results show that the 3W-ADPC algorithm can not only well explain the clustering structure,but also has a good performance.(Ⅱ)The complexity of the data type and distribution leads to the increase in uncertainty in the relationship between samples,which brings challenges to effectively mining the potential cluster structure of data.Ensemble clustering aims to obtain a unified cluster division by fusing multiple different base clustering results.This thesis proposes a three-way ensemble clustering algorithm based on sample’s perturbation theory to solve the problem of inaccurate decision making caused by inaccurate information or insufficient data.The algorithm first combines the natural nearest neighbor algorithm to generate two sets of perturbed data sets,randomly extracts the feature subsets of the samples,and uses the traditional clustering algorithm to obtain different base clusters.The sample’s stability is obtained by using the co-association matrix and determinacy function,and then the samples can be divided into a stable region and unstable region according to a threshold for the sample’s stability.The stable region consists of high-stability samples and is divided into the core region of each cluster using the K-means algorithm.The unstable region consists of low-stability samples and is assigned to the fringe regions of each cluster.Therefore,a three-way clustering result is formed.The experimental results show that the proposed algorithm in this thesis can obtain better clustering results compared with other clustering ensemble algorithms on the UCI Machine Learning Repository data set,and can effectively reveal the clustering structure.
Keywords/Search Tags:DPC, Natural nearest neighbor, Three-way decision, Three-way clustering, Ensemble clustering
PDF Full Text Request
Related items