K-means algorithm is one of the most popular clustering algorithms andit is widely used in computer vision, text mining, customer analysis and otherfields. The K-means algorithm is simple and efficient but it suffers from twomain problems. K-means is sensitive to the initial cluster centers and needuser to give the K value in advance. Agglomerative fuzzy K-means algorithmis not sensitive to initial cluster centers and can find real number of clusterswith an agglomerative procedure. But the agglomerative fuzzy K-meansalgorithm has a disadvantage in time cost for it takes a lot of iterations to findthe best k.In this thesis, we first propose an enhanced algorithm based onagglomerative fuzzy kmeans. In the enhanced algorithm we replace therandom initial value selection method used in agglomerative fuzzy kmeanswith a new initial center selection method to reduce time cost of thisalgorithm. We also present a mapreduce implementation of the enhanced toimprove the algorithm’s ability on handling large scale dataset. In this thesiswe also study the method and problem when clustering micro-blog users. Weintroduce a topic model method based method to get user vectors. The topicmodel is trained on Chinese Wikipedia and then applied to micro-blog.Finally, we apply the enhanced agglomerative fuzzy kmeans on micro-bloguser clustering. Experimental results show that the new algorithm can reducethe. Weibo user clustering results were analyzed show that users can obtainthe clustering results of the algorithm suitable. |