Font Size: a A A

The Optimization Research On K-Modes Clustering Algorithm

Posted on:2020-10-17Degree:MasterType:Thesis
Country:ChinaCandidate:B JiaFull Text:PDF
GTID:2428330623456673Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Clustering algorithm can be used to divide entire sample set into multiple communities to discover meaningful sample groups.Therefore,as an efficient data analysis tool,clustering algorithm has been one of the hottest research techniques widely studied by domestic and foreign scholars for a long time.The K-Modes clustering algorithm proposed by Huang extends the K-Means clustering algorithm by using the attribute matching metric formula,so that it can perform cluster analysis on the unordered categorical data.The algorithm uses 0-1 simple matching as a measure of dissimilarity.This metric weakens the similarity of attribute values within a class,ignoring the difference between attributes.The Modes with single attribute value neglects the possibility that a certain attribute may have multiple attribute value combinations,and the algorithm is greatly affected by initial center points.All of the above problems may lead to a lower accuracy of clustering results.In addition,due to the explosive growth of data,traditional algorithms for serial execution are difficult to process very large data sets and very high dimensional data models in effective time.As the latest big data platform,Spark is good at perform massive data analysis tasks.However,there is no clustering algorithm for categorical data in Spark's existing machine learning algorithm library,which makes it impossible to use Spark to deal with the clustering problem of massive categorical data effectively.To deal with the above problems,this dissertation firstly proposes a MAV-K-Modes clustering algorithm,which uses a multi-attribute value Modes initialization method based on pre-clustering and a dissimilarity measurement method based on multi-attribute value Modes.Secondly,based on the improved MAV-K-Modes clustering algorithm,this dissertation parallelizes the algorithm based on the Spark and proposes the suitable parallelization schemes for static datasets and incremental datasets respectively.The main research contents of this dissertation are as follows:(1)To solve the problem of accuracy,this dissertation proposes a MAV-K-Modes clustering algorithm based on multi-attribute value Modes to improve the clustering effect of unordered categorical data sets.The algorithm uses a multi-attribute value Modes initialization method based on pre-clustering,which reduces the influence of local optimal solution.Using a dissimilarity metric method based on multi-attribute value Modes improves the shortcomings of the traditional K-Modes algorithm,which uses simple 0-1 matching metric method,prevents the loss of important attribute values in clustering process effectively,and strengthens the similarity between attribute values within a same class.Using theory of information entropy to calculate the weight of different attributes,which strengthens the difference between attributes.(2)To solve the problem of execution efficiency,this dissertation proposes a parallel MAV-K-Modes clustering algorithm for static datasets and an incremental MAV-K-Modes clustering algorithm for incremental datasets based on Spark.The parallel MAV-K-Modes clustering algorithm can effectively improve the efficiency of clustering algorithm for massive categorical data without affecting the accuracy of results.The incremental MAV-K-Modes clustering algorithm can effectively improve the efficiency of clustering for incremental datasets with slightly reducing of the accuracy of clustering results.
Keywords/Search Tags:Unordered Categorical Data, Clustering Algorithm, K-Modes, Big Data, Spark
PDF Full Text Request
Related items