Font Size: a A A

A Cluster Validity Index Based On Binary Tree Nearest Neighborhood

Posted on:2024-03-23Degree:MasterType:Thesis
Country:ChinaCandidate:C K HouFull Text:PDF
GTID:2568307064480974Subject:Computational Mathematics
Abstract/Summary:PDF Full Text Request
Data mining is a method to discover potential knowledge from massive data,while cluster analysis is a important method of data mining.Cluster analysis is an unsupervised machine learning method commonly used in statistics,natural language processing,and image processing.Its main goal is to assign the samples with the same property in the dataset to the same cluster,and place the samples with different properties in different clusters.However,since cluster analysis is an unsupervised learning method,how to determine the number of clusters in the dataset is particularly important.Different number of clusters may lead to completely different cluster results.Recent years,researchers have proposed a series of cluster validity indexes to evaluate the cluster results of datasets in order to determine the number of clusters.However,these indexes are not suitable for datasets with complex manifolds or datasets with a lot of noisy points.To struggle with these problems,our study conducts in-depth research and adaptively proposes a method for finding the neighborhood structure,and applies them to build a cluster validity index.The specific work is as follows:(1)We define a new nearest neighborhood structure in our paper,which is called as binary tree nearest neighborhood.It is a binary tree neighborhood structure by continuously identifying the relationship between the sample and its two nearest samples.Then,we introduce an algorithm about how to find binary tree nearest neighborhoods in the dataset.This algorithm solves the problem that the traditional nearest neighbor algorithm is sensitive to hyperparameters,and can adaptively generate binary tree nearest neighborhoods according to the distribution of samples in the dataset.This algorithm can be commonly used to preprocess the datasets.(2)We propose a novel cluster validity index based on the binary tree nearest neighborhoods,which is called as BTCV index.In view of the fact that the existing clustering evaluation indexes cannot evaluate a variety of complex manifold clusters and are easily disturbed by noise points,we propose to use the distance between binary tree nearest neighborhoods instead of the distance between samples to evaluate the cluster results.The degree of closeness and separation between clusters is used to determine each cluster’s quality,and finally the average value of the clustering quality for each cluster is regarded as the BTCV index of the dataset.Finally,by comparing other indexes and BTCV index to find the number of clusters on the artificial and the real datasets,it is proved that BTCV index not only avoids the interference of noises,but also gets the optimal cluster numbers in datasets containing more types of manifold clusters.(3)We propose a minimum spanning tree cluster algorithm based on binary tree nearest neighborhoods that can automatically obtain the optimal number of clusters for a dataset.We call it the MST-BTCV algorithm.At present,although the minimum spanning tree cluster algorithm is suitable for datasets with different complex shapes of clusters,it needs to manually set the number of clusters in the datasets and is easily affected by noises.To solve these problems,we combine the BTCV index with the minimum spanning tree cluster algorithm and propose the MST-BTCV algorithm.Experimental results in the artificial datasets and the real datasets show MST-BTCV algorithm can not only accurately divide multiple types of datasets,but also obtain the optimal number of clusters in different datasets.
Keywords/Search Tags:Nearest neighborhood structure, Minimum spanning tree, Cluster analysis, Cluster validity index
PDF Full Text Request
Related items