Font Size: a A A

Determination Of Optimal Clustering Number Of Mixed Data And Its Application

Posted on:2020-07-21Degree:MasterType:Thesis
Country:ChinaCandidate:X L LiFull Text:PDF
GTID:2428330575952163Subject:statistics
Abstract/Summary:PDF Full Text Request
With the frequent appearance of "big data",data mining has become a hot term,which represents a comprehensive process of transforming large and mixed data into information.Cluster analysis is one of its important research directions.Clustering algorithm is an important tool of clustering analysis,and the number of clusters is often the key to determine the performance of clustering algorithm,which is because most clustering algorithms need to give the number of categories in advance.Therefore,determining the optimal number of clusters is an important step for us to better carry out data mining.Because of the increasing complexity of the research,more and more mixed data become the processing object,and then the clustering research on the mixed data is not very much.Because of the good development,the clustering validity of mixed data can be used to determine the optimal clustering number in the present era with pertinence and applicability.In this paper,the importance of cluster validity of mixed attribute data is analyzed under the background of big data,and then the cluster validity problem is deeply studied.After that,the existing clustering algorithm is improved to improve the efficiency of the algorithm.Finally,the DSKP algorithm for mixed attribute data is proposed,and the advantages and characteristics of the algorithm are summarized.The possible development direction of cluster validity in the future is proposed.In the aspect of improving clustering algorithm,the following innovative work has been done:(1)the features based on data attributes reduce the randomness of initial clustering center selection,and a simple random sampling method is proposed for large-scale data.In order to reduce the impact of outliers on the clustering results and improve the efficiency of the algorithm.(2)based on the mixed data principal component method,an improved classification data weight is proposed.According to the principle of mixed data principal component analysis,the multivariable data can be synthesized into several synthetic factors.The AFDM function of R software is used to deal with the mixed attribute data set after noise removal.The weight of the classified data is calculated according to the ratio of the classified data to the comprehensive factor.(3)according to the advantages of DES evidence theory in determining uncertain factors,the improved clustering algorithm is combined with the traditional version and the evolutionary version of Dus evidence theory,and two versions of DSKP algorithm are proposed.The superiority and universality of the proposed method in solving the clustering validity problem of mixed data are verified by an example.
Keywords/Search Tags:Clustering algorithm, K-prototypes, Clustering validity, D-S evidence theory, Data mining
PDF Full Text Request
Related items