Font Size: a A A

Randomized Algorithms For Multiple Center Clustering Problems

Posted on:2011-01-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:S Q WangFull Text:PDF
GTID:1118360302499812Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Both the k-means problem and the k-median are typical computational model in clustering analysis. They come from the similar background. The k-means clustering problem is the following:Given a point set P, compute a set of k centers such that the sum of squares of the distances of the points in P to their respective nearest center is minimized. The k-median differs from the above in that instead of the sum of squares of distances, it minimizes the sum of distances. In this paper, the randomized algorithms for k-means problem in Euclidean space and k-median problem in Metric space are discussed.The k-means problem in Euclidean space can be described as follows:Given point set P of points in Euclidean space, compute a set C of k centers and partition P into k subsets P1,…,Pk, such thatΣx∈P1d(x,C)2 is minimized. Here d(x,C) refers to the distance between x and the closest center in C.Meanwhile, the k-median problem in Metric space can be described as follows: given two sets:F, the set of facilities, and C, the set of clients, let d(fi,cj)∈Z+ denote the cost of serving client cj by a facility fi, the goal in this problem is to identify a subset of at most k facilities of S(?)F and to serve all clients by facilities in S such that the sum of the serving cost is minimized.Drineas has proved that the k-means problem is NP-Hard in Euclidean space, even if k=2, and the k-median problem is also the NP-Hard problem proved by Hamiki. In these two problems, if the size of each subset Pi or Ci is requested to be at least some given valueθ(≤n/k),.i.e.|Pi|≥θor|Ci|>θ(1≤i≤k), we call this partition as balanced constraint and the parameterθas balanced constraint condition. In this paper, the randomized algorithm and their practical variants for these two problems are discussed. The mainly results are obtained as follows:An improved randomized (1+ε)-approximate algorithm proposed by Ostrovky is given for k-means problem. By the value of balanced constraint condition, a reverse greedy randomized algorithm for the k-median problem is proposed. Based on the local search heuristic, a randomized algorithm for k-means problem is presented. An improved algorithm for Lloyds heuristic algorithm is also presented in the paper. The specific research works are as follows.(1) Ostrovsky proposed a randomized (1+ε)-approximate algorithm for the k-means problem with running time O(2O(k(1+a2)/εdn), Where n andα(<1) denote the size of the given point set and the separated coefficient respectively. An improved randomized algorithm with running time O(2O(ka2/ε)nd) is proposed in this paper. Given point set P, let P1*,…,Pk* denote the k optimal partition subsets. The main process of the previous algorithm is described as follows:for each subset P,*, two subsets Bi and Ri such that Bi(?)Pi*(?)Ri and |Pi*|≥β|Ri|(0<β≤1) are calculated. Then for each i,4/βεpoints are picked independently and uniformly at random from Ri, whereεis a given input parameter. Let Si denote the set of these sample points. For each i, one subset of size of 2/εpoints from Si is chose and its centroid is used as the potential center for Pi*. By enumerating all possible k centers from S1,…, Sk, the k potential centers that yield the least-cost solution are returned as the final centers. In this paper, an improved al gorithm is proposed. These improvements are included as follows:First, while sampling 4/βεpoints from Ri, the sampling parameterβis enlarged. The previous algorithm choose the parameterβwithβ=1/1+144α2, but it is proved thatβcan enlarged to 49/49-3600α2≈1/1+73.5α2 . It is easy to see that the number of sampling points is reduced. Secondly, based on the relation of Bi(?)Pi*(?)Ri, each sampling subset Si is divided into two subsets. The first subset can be ensured to be included in Pi*, however, the points of the second subset could not be decided whether or not they belong to Pi*. So, only a few points are enumerated from the second subset and merged with the first subset so as to make the size of the merging subset be 2/εpoints. Compared to the previous algorithm, it is easy to see that number of enumerations can be reduced. It is proved that the expected running time of our improved algorithm is O(2O(ka2)/ε)dn). At last, it is proved that the improved algorithm has the success probability of (|(1/2(1-e(-1/26))k(1-O(?)) to find the (l+s)-approximate solution. However, the previous work did not give the actual successful probability expression of finding the (1+ε)-approximate solution.(2) By the value of balanced constraints for 6, a reverse greedy 3+O(ln(ln(k)/α) randomized approximate algorithm for k-median problem is proposed in this paper, where a(≤1) is the balanced constraints parameter. The main process of this algorithm is described as follows:Firstly a sampling set S of size of O(k/αIn(k)) is generated independently and uniformly from the clients set C. It is proved that S contains at least one client of each divided optimal subset with high probability. For each client of S, the closest facility to the client in the facility set F is calculated. Let Fs denote these facilities. It is obvious that|FS|≤|S|. By ignoring F and considering the facility subset Fs and the clients set C as the instance, the reverse greedy algorithm is runned to find k facilities to serve all clients C. It is proved that the new algorithm's expected approximate ratio is (3+O(ln(ln(k)/a)) with high probability. The running time of the algorithm is [k/aIn(k)]2 (n+m), where n and m represent the number of clients and facilities for the given instance respectively. Compared to the Chrobak's reverse greedy algorithm, the approximation ratio of Chrobak's algorithm is between Q(log(n)/log(log(n))) and O(log(n)) and its running time is O(n3), so our new algorithm has better performance than the previous.(3) A randomized local search algorithm is proposed in this paper. Minjun Song has presented a local search algorithm for k-means clustering. In the algorithm, the input points are used as the candidate centers. The main process of the algorithm can be described as follows:initially select an arbitrary set of k centers from the input points. For some integer p, swap between any subset of p'(≤p) centers from the center set and P'elements from the input points if the new centers decreased the cost. Repeat the swapping procedure until there is no cost change. If any one point from each of the k different optimal subsets is chose and used as the potential center for the optimal subset, it is proved that the expected cost is at most 2 times as much as optimal cost. Based on this conclusion, two random strategies for Mingjun Song's Algorithm are put forward. The first strategy is that k points from the input point set P are picked according to a non-uniformly probability distribution to make them belong to each one point of the k different optimal subset as likely as possible. The second random strategy is to present a new method for generating a candidate center set. In this paper, by the balanced constraints, a sampling set S is generated uniformly from the given points. The set S has the property of containing at least one point of each optimal subset with high probability. S is used as the candidate set instead of P, it is easy to see that the size of candidate set and the running time can be reduced. The Minjun Song's algorithm running time is O(n2k3log(n)d). it is proved that the new algorithm running time is O(nk3dlog(n)log(k)/α), where n and a(<1) denote the size of the give points set and the balanced constraint parameter respectively. Compared to the Mingjun Song's algorithm, the experimental results show that the new algorithm has better performance than the previous algorithm both in the running time and the solution.(4) The Lloyd's algorithm is one of the most widely used clustering algorithms for k-means clustering. Initially k centers are chose from the input points randomly in this algorithm. For each input point, the nearest center is identified. Points that are chose the same center belong to the same cluster. Now new centers are calculated for the clusters. Each input point identifies its nearest center; and so on. This process is repeated until no changes occur. In this paper, an improved algorithm for the Lloyd's algorithm is presented. Specifically, while selecting k initial centers in the Lloyd's algorithm, the method of the random local search algorithm proposed above is employed; and while partitioning the given point set into k subsets, the method proposed by Elkan is applied, which use the tranigle inequality principle to avoid unnessary calculations for distance. The experimental results show that both the solution and the running time, the improvement algorithm has better performance than the Lloyd's algorithm.
Keywords/Search Tags:Randomized Algorithm, Approximate Algorithm, k-means, k-median, Computational Complexity
PDF Full Text Request
Related items