Font Size: a A A

Research On Text Clustering Algorithm Based On K - Means

Posted on:2015-11-02Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhouFull Text:PDF
GTID:2208330431476714Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The amount of text information on the Internet increased rapidly nowadays, it takes more and more time for people to search and sort information from the Internet, against this background, manual jobs will be significantly reduced and the query efficiency will be improved by using the technology of text clustering which will sort the text information automatically and extract the leading feature of the text information for retrieving, it is meaningful.The Chinese texts used as experimental data which come from the Internet crawler. A text clustering algorithm based K-Means is proposed in this paper by analyzing the differences between numerical data and text data, different measurements and the evaluation of clustering results. Initial cluster center selection, determination of K value and keywords extracting of text clusters are improved in the algorithm of this paper.First, we use the Maximum-Minimum principle instead of random method to select the initial cluster centers and experimental results showed that this method can effectively improve the clustering accurate, recall and F1. Besides, it obtained a stable clustering result.Second, for the problem of K value should be given before the algorithm’s running, a method based on clustering result’s validity is proposed in this paper. The K value’s determining depends on the analytics of cohesion and resolution of clusters. Experimental results showed that this method can automatically discovery the accurate K value as the count of classification of the text data sets.Third, a method named TF-ICF which is used to extracting characteristic words of text clusters is proposed in this paper, characteristic words can be sorted with its weight in text clusters, and then we can extract the characteristic words which has high weight. Experimental results showed that this method can extract the valid characteristic words from text clusters.Last, a text clustering system based on the K-Means algorithms proposed in this paper is designed and implemented.
Keywords/Search Tags:Text Clustering, K-Means, Accurate K Value, Characteristic Words Extract
PDF Full Text Request
Related items