Font Size: a A A

Clustering Algorithm Of Missing Data Based On Dissimilarity Measure

Posted on:2022-01-23Degree:MasterType:Thesis
Country:ChinaCandidate:W W ChenFull Text:PDF
GTID:2518306605971339Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
Clustering is a technique for classifying data,which is widely used in the fields of image segmentation,financial analysis and information retrieval.Clustering divides data into clusters according to the similarity among data objects,so that the elements in each cluster are as similar as possible,while the elements in different clusters are as different as possible.In reality,due to system faults,measurement errors,electronic noise and other reasons,the problem of data missing is common.Most of the datasets are incomplete datasets with missing values.Most clustering algorithms can only model and analyze on complete datasets,and cannot deal with data with missing values.When there are missing values in the dataset,how to perform clustering analysis with high quality becomes the focus and difficulty.In this thesis,we study the problem of missing value clustering and propose two methods to cluster incomplete datasets.Compared with traditional missing value clustering methods,the algorithm proposed has a significant improvement in the performance of missing value clustering.The main work is as follows:1.An adaptive mean imputation algorithm is proposed to solve the problem of homogenous filling value.This method determines the adjustment direction according to the dissimilarity between the observable features of the sample and the average level of the datasets,which uses the adjustment coefficient and the standard deviation of the observable features as adjustment items to correct the mean imputation.The adaptive mean imputation value can avoid homogenized interpolation,so that the data set after interpolation has a certain data diversity.The experiments evaluate the adaptive mean imputation algorithm from two perspectives:imputation effectiveness and clustering performance.The results show that the adaptive mean imputation algorithm is better than the mean imputation algorithm.The root mean square error of adaptive mean imputation is reduced by 46.3%,and the clustering effectiveness of imputated datasets is improved by 16.9%.2.Aiming at the problem that the clustering algorithm cannot cluster the incomplete datasets directly,a dissimilarity measure method is proposed.Dissimilarity measure is a method to evaluate the difference between samples of missing datasets.The method corrects the Euclidean distance by the standard deviation of penalty coefficients and observable features.The k means cluster algorithm is improved by using the dissimilarity measure.Therefore the algorithm can directly cluster incomplete datasets,expanding the application scenarios of the k means cluster algorithm.The results show that the k means cluster algorithm based on dissimilarity measure outperforms the traditional methods.3.The impact of data missing mechanism on clustering performance is studied.This thesis introduces the type of clustering algorithm and the method of validity evaluation and explains the mechanism of missing value.The experiment of the dataset verifies the effect of the data missing mechanism.The results show that the data missing mechanism has a significant impact on the analysis of the dataset.In the case of the same data missing rate,the clustering results of different missing mechanisms are different by up to 50%.With the increase of data missing rate,the error of imputation becomes larger and larger,and the accuracy of clustering analysis decreases significantly.
Keywords/Search Tags:Clustering algorithms, Missing value, k means cluster, Mean imputation, Dissimilarity measure
PDF Full Text Request
Related items