Clustering Algorithm Of Missing Data Based On Dissimilarity Measure

Posted on:2022-01-23

Degree:Master

Type:Thesis

Country:China

Candidate:W W Chen

Full Text:PDF

GTID:2518306605971339

Subject:Control theory and control engineering

Abstract/Summary:

PDF Full Text Request

Clustering is a technique for classifying data,which is widely used in the fields of image segmentation,financial analysis and information retrieval.Clustering divides data into clusters according to the similarity among data objects,so that the elements in each cluster are as similar as possible,while the elements in different clusters are as different as possible.In reality,due to system faults,measurement errors,electronic noise and other reasons,the problem of data missing is common.Most of the datasets are incomplete datasets with missing values.Most clustering algorithms can only model and analyze on complete datasets,and cannot deal with data with missing values.When there are missing values in the dataset,how to perform clustering analysis with high quality becomes the focus and difficulty.In this thesis,we study the problem of missing value clustering and propose two methods to cluster incomplete datasets.Compared with traditional missing value clustering methods,the algorithm proposed has a significant improvement in the performance of missing value clustering.The main work is as follows:1.An adaptive mean imputation algorithm is proposed to solve the problem of homogenous filling value.This method determines the adjustment direction according to the dissimilarity between the observable features of the sample and the average level of the datasets,which uses the adjustment coefficient and the standard deviation of the observable features as adjustment items to correct the mean imputation.The adaptive mean imputation value can avoid homogenized interpolation,so that the data set after interpolation has a certain data diversity.The experiments evaluate the adaptive mean imputation algorithm from two perspectives:imputation effectiveness and clustering performance.The results show that the adaptive mean imputation algorithm is better than the mean imputation algorithm.The root mean square error of adaptive mean imputation is reduced by 46.3%,and the clustering effectiveness of imputated datasets is improved by 16.9%.2.Aiming at the problem that the clustering algorithm cannot cluster the incomplete datasets directly,a dissimilarity measure method is proposed.Dissimilarity measure is a method to evaluate the difference between samples of missing datasets.The method corrects the Euclidean distance by the standard deviation of penalty coefficients and observable features.The k means cluster algorithm is improved by using the dissimilarity measure.Therefore the algorithm can directly cluster incomplete datasets,expanding the application scenarios of the k means cluster algorithm.The results show that the k means cluster algorithm based on dissimilarity measure outperforms the traditional methods.3.The impact of data missing mechanism on clustering performance is studied.This thesis introduces the type of clustering algorithm and the method of validity evaluation and explains the mechanism of missing value.The experiment of the dataset verifies the effect of the data missing mechanism.The results show that the data missing mechanism has a significant impact on the analysis of the dataset.In the case of the same data missing rate,the clustering results of different missing mechanisms are different by up to 50%.With the increase of data missing rate,the error of imputation becomes larger and larger,and the accuracy of clustering analysis decreases significantly.

Keywords/Search Tags:

Clustering algorithms, Missing value, k means cluster, Mean imputation, Dissimilarity measure

PDF Full Text Request

Related items

1	The Online Imputation Method Of Missing Value Based On KNN And Its Application In Credit Evaluation
2	Studies On Missing Data Imputation
3	A Novel Missing Data Imputation Method Based On K-means Algorithm And Association Rules
4	Comparative Study On Imputation Methods Of Missing Data In XGBOOST Model Under Complete Random Missing Mechanism
5	Research On Data Cleaning Based On Clustering
6	Nonparametric Imputation For Missing Data
7	The Research On Clustering Algorithm For Categorical Data Using Quantum Mechanics
8	The Analysis And Improvement Research Of Knn-imputation Algorithm
9	Attribute Associated Neuron Modeling And Missing Value Imputation Based On Neural Network
10	Research On Missing Value Imputation Of Incomplete Data