Font Size: a A A

Text Clustering Based On K-means Algorithm And Realization

Posted on:2011-11-14Degree:MasterType:Thesis
Country:ChinaCandidate:J GaoFull Text:PDF
GTID:2208360302970040Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development and far-ranging popularization of net,the information whith people received is increasing exponentially. Text,as the important carrier of information, contains a large number of valuable resources and waiting people to min and research.But the diversity and complexity of text information was not propitious for finding interesting and relevant information. As its borning and development of Text mining, people can quickly and efficiently find simple, concise and understandable knowledge from a large number of text resources.Text clustering,as an important branch of text mining ,the research of text clustering is more and more attracted widespread attention.K-means algorithm with its simplicity and rapidity is widely used in the text clustering. However, the traditional K-means algorithm was highly dependent on the initial value,and it needs to give the parameter k in advance, which is always determined by knowledge and experience of users. In addition,the initial cluster centers of it are randomly selected, this randomness offen led to instability of clustering results. So to speak, the different k values and different initial cluster centers had a great impact on clustering quality and time efficiency.This article does a more comprehensive study work of text mining and cluster analysis which include:First of all,this article does a more in-depth research of the relevant theory and key technologies of text mining, including Chinese word segmentation, dimensionality reduction operation, text signify, weight evaluation and similarity calculation.Secondly, after a depth research of traditional K-means algorithm, this paper improved it in two ways for its deficiency:(1) Proposed sector-segmentation method to determine the initial number of clusters k. Firstly, according to the grade importancy which text feature items described text, select the text set which contained important features as a sample. Used fan-shaped segmentation method to cluster this sample initially, and put the initial number of clusters as the value of k in K-means algorithm.(2) To start with the text mining's own characteristics, through the distribution of clustering to determine the initial center reversly, according to principle of small similarity between each cluster center and large similarity between cluster center and the other text object clusters, to search the most effective text object as the initial cluster center: the similarity was low between these pairwise centers,and each center had objects which similarity are high and the number is greater than a certain threshold around.Finally, this article implemented a simple text clustering system based on K-means algorithm,and on its basis we verified the effectiveness of improved K-means algorithm through experiments,results showed improved K-means algorithm can solve the problem of the instability of clustering effect generated by randomness,and its time complexity was also decreased.
Keywords/Search Tags:text mining, clustering, K-means, improved K-means, sector-segmentation method
PDF Full Text Request
Related items