Font Size: a A A

Optimize K-Means Algorithm To Apply In Text Clustering

Posted on:2008-05-10Degree:MasterType:Thesis
Country:ChinaCandidate:K YuFull Text:PDF
GTID:2178360218956623Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Along with the popularization of Internet and improvement of enterprise informationization, unstructured text data such as HTML data and free text files or semi-structured text data such as XML data increase at an astonishing speed. Management and analysis of text data become very important. Text clustering is one of the fundamental functions in text mining. Clustering is to divide a collection of text documents into different category groups in that similarity of documents in same category group is as far as possible large while similarity of documents in different category group is as far as possible little.Since 20th century 50's, people have proposed many kinds of clustering algorithms. They approximately may be divided into based on division and based on level two kinds. Among based on division clustering algorithms, what most famous is the K-Means type algorithm. Since it was published by MacQueen in 1967 for the first time, it has become one of prevalent clustering algorithms in mathematical statistic, pattern recognition, machine learning and data mining etc, and has developed many kinds of derivative algorithms, formed the K-Means algorithm family. Owing to their rapidity and simplicity, these K-Means type algorithms are suitable for text, picture characteristic and so on many kinds of data clustering analysis.However, owing to its random selection of initial centers, unstable results were often gotten while using traditional K-Means and its variants. This paper sorts each point according to density, through self-adoptively selecting optimized density radius to determine biggest point density, selects the points which density is bigger as well as reasonable to take as initial central points, thus can optimize the choice of central points, enable K-Means algorithm to have a good start. Meanwhile owing to text characteristic matrix is high dimensional and sparse, each category is limited to the subset of key words when clustering, therefore this paper endows variables with different weight in each cluster according to the contribution of clustering, important variables are endowed with bigger weight, thus can effectively solve the question of high dimension and sparseness of text data, remarkably improve the accuracy of K-Means algorithm, find good clusters fast, obtain a optimized algorithm suiting to text data. This paper has made two aspects important improvements to K-Means algorithm, the experimental results show that the optimized algorithm can produce high quality and steady clustering results. Meanwhile, in order to express the result of clustering easily, we select several important key words to express clusters appropriately in that the content of clusters can be understood correctly and the performance and efficiency of information processing can be enhanced.
Keywords/Search Tags:clustering, K-Means, density radius, self-weighting, index
PDF Full Text Request
Related items