Font Size: a A A

Research On Semi-supervised Size Constrained Clustering

Posted on:2021-02-28Degree:MasterType:Thesis
Country:ChinaCandidate:W TangFull Text:PDF
GTID:2428330623479537Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Clustering is one of the essential unsupervised learning tools for data mining since it reveals the natural structures of the unlabeled data.However,traditional clustering algorithms are not designed to solve the semi-supervised problems with prior size constraints that exist in many real applications.Most of the size constrained clustering methods focus on soft constraints.Little attention is drawn to hard constraints.Moreover,there is still much room for improvement in the accuracy and time complexity of existing methods.At the same time,the researchers partly ignored the optimal modeling of the size constrained clustering.In this thesis,we propose two novel hard size constrained clustering methods based on Integer Linear Programming?ILP?.Firstly,we propose a balanced clustering algorithm.The task is to partition observations into clusters so that the sizes of all the clusters are approximately the same?±1?.We optimize Mean Squared Error?MSE?based on ILP to optimal model the clustering problem with balanced size constraints.For the proposed model,we propose an iterative method to solve it.Each iteration can be split into two steps,namely an assignment step and an update step.In the assignment step,the data are evenly assigned to each cluster.The balanced assignment task here is formulated as an ILP,and we prove that the constraint matrix of this ILP is totally unimodular.Thus,the ILP is relaxed as a Linear Programming?LP?which can be efficiently solved with the simplex algorithm.In the update step,the new centers are updated as the means of the observations in the clusters.We conduct several experiments on the synthetic and UCI data sets.Assuming that there are9)observations and the algorithm needs8)iterations to converge,we show that the average time complexity of the proposed algorithm is?8?9)1.65)-?8?9)1.70).Experimental results indicate that compared with the state-of-the-art balanced clustering methods,the proposed algorithm is efficient in deriving more accurate clustering.Secondly,based on the proposed method for balanced size constrained clustering,we extend it to size constrained clustering,which means that each size of the cluster can be assigned by users.Given an unordered set of cluster size constraints,the key is to assign observations into clusters while simultaneously considers the optimal size constraints.Similarly,we model this problem as an ILP problem that optimizes MSE and use an iterative method to solve it.Each iteration of the proposed method consists of two steps,namely an assignment step and an update step.In the assignment step,the task is still be represented as an ILP problem.We use the Observation Partition Decision Variables?OPDVs?to indicate the relations between the observations and the clusters.Furthermore,we introduce the Cluster Size Decision Variables?CSDVs?.We prove that the integer constraints on the OPDVs is totally unimodular.Therefore,only the integer constraints on CSDVs should be kept so that the problem would become a Mixed Integer Linear Programming?MILP?problem which is much easier to solve.In the update step,the behavior is the same as it of the balanced size constrained clustering.Experiments on UCI data sets indicate that?1?imposing the size constraints as proposed could improve the clustering performance;?2?compared with the state-of-the-art size constrained clustering methods,the proposed method could efficiently derive better clustering results.Lastly,to evaluate the proposed methods in real applications,we develop a prototyping system.Through such the system,we can observe that the proposed methods not only have the theoretical significance but also have the value of applications.
Keywords/Search Tags:constrained clustering, balanced size constraints, size constraints, mean squared error, linear program
PDF Full Text Request
Related items