Optimize K-Means Algorithm To Apply In Text Clustering

Posted on:2008-05-10

Degree:Master

Type:Thesis

Country:China

Candidate:K Yu

Full Text:PDF

GTID:2178360218956623

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Along with the popularization of Internet and improvement of enterprise informationization, unstructured text data such as HTML data and free text files or semi-structured text data such as XML data increase at an astonishing speed. Management and analysis of text data become very important. Text clustering is one of the fundamental functions in text mining. Clustering is to divide a collection of text documents into different category groups in that similarity of documents in same category group is as far as possible large while similarity of documents in different category group is as far as possible little.Since 20th century 50's, people have proposed many kinds of clustering algorithms. They approximately may be divided into based on division and based on level two kinds. Among based on division clustering algorithms, what most famous is the K-Means type algorithm. Since it was published by MacQueen in 1967 for the first time, it has become one of prevalent clustering algorithms in mathematical statistic, pattern recognition, machine learning and data mining etc, and has developed many kinds of derivative algorithms, formed the K-Means algorithm family. Owing to their rapidity and simplicity, these K-Means type algorithms are suitable for text, picture characteristic and so on many kinds of data clustering analysis.However, owing to its random selection of initial centers, unstable results were often gotten while using traditional K-Means and its variants. This paper sorts each point according to density, through self-adoptively selecting optimized density radius to determine biggest point density, selects the points which density is bigger as well as reasonable to take as initial central points, thus can optimize the choice of central points, enable K-Means algorithm to have a good start. Meanwhile owing to text characteristic matrix is high dimensional and sparse, each category is limited to the subset of key words when clustering, therefore this paper endows variables with different weight in each cluster according to the contribution of clustering, important variables are endowed with bigger weight, thus can effectively solve the question of high dimension and sparseness of text data, remarkably improve the accuracy of K-Means algorithm, find good clusters fast, obtain a optimized algorithm suiting to text data. This paper has made two aspects important improvements to K-Means algorithm, the experimental results show that the optimized algorithm can produce high quality and steady clustering results. Meanwhile, in order to express the result of clustering easily, we select several important key words to express clusters appropriately in that the content of clusters can be understood correctly and the performance and efficiency of information processing can be enhanced.

Keywords/Search Tags:

clustering, K-Means, density radius, self-weighting, index

PDF Full Text Request

Related items

1	Research On Improved K-means Clustering Algorithm Based On Density
2	Research On Cluster Center Optimization Of K-means Algorithm
3	Improving Of Clustering Algorithm And Research On Clustering Validity Index
4	Multi-Improvement On Density-Based Clustering Algorithm And Its Applications
5	The Research And Implementation Of Density-based Clustering Algorithm With Pattern Evaluation Methods
6	Research And Implementation On Variable Weighting In K-means Type Clustering
7	Research On The Selection Of Initial Cluster Centers In K-means Algorithm
8	Research On KNN Algorithm Based On Clustering Of Training Set And Its Application
9	Research On Improvement Of K-means Clustering Algorithm
10	The Research And Application Of The Methods To Determine The Clustering Radius