Font Size: a A A

Research On The Incremental Text Clustering Based On Cluster Features

Posted on:2013-04-16Degree:MasterType:Thesis
Country:ChinaCandidate:M PanFull Text:PDF
GTID:2298330377959819Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Now, with the increasing development of network and computer technology, theInternet has become a major source of people obtaining information. Facing to theincreasing and continuous growth of the Internet’s information, it is more and moredifficult that we want to obtain information we actually want from the large-scale datausing traditional methods. Therefore, how to effectively organize and manage theinformation becomes an urgent problem in the current. Text clustering analysis is aneffective tool of organizing and managing the textual information, and it can find outthe potential useful model from the large-scale data.However, when clustering large-scale textual data set, the traditional clusteringalgorithm has high time complexity; and when needing update, the entire data setmust be re-clustered, which will also greatly reduce the efficiency of the clustering.Against to these problems, people look forward to updating incrementally. Theincremental clustering algorithm, which is based on the existing clustering results,deals with the data set one by one or batch, and it can reduce the time complexity andimprove the efficiency of the clustering, but it is a problem worth of study on theincremental clustering algorithm that how to achieve the same effect as the traditionalclustering algorithm.The paper presents a text incremental clustering algorithm based on clusterfeatures. The algorithm consists of two stages of initial and incremental clustering. Atthe stage of initial clustering, the paper firstly makes full use of simple and efficientk-means algorithm to cluster and retain the clustering center, mean, variance, numberof document, the third central moment and the fourth central moment as the clusterfeatures of each cluster. When new documents occur, the algorithm enters to the stageof incremental clustering. Firstly, the paper calculates the score between each newdocument and the cluster which obtains from the stage of initial clustering. To furtherimprove the accuracy of the clustering, the paper uses the value of similarity andEuclidean distance to calculate the score between the new document and the existingcluster. Then, the document is put into the cluster having the highest score and thecluster features of the cluster are updated. Finally, the cluster that the documentbelongs to is determined according to the change of the cluster features updatedbefore and after. Through the method, we no longer need re-cluster the entire data set. Primary work of the paper is as follows:1. The text incremental clustering algorithm based on cluster features is proposedand the results of it are compared with the text non-incremental clustering algorithm(traditional clustering algorithm). The experimental results on20newsgroups data setdemonstrate that the algorithm the paper presents has higher purity, lower timecomplexity and it can achieve better effect than the traditional clustering algorithmand the comparison results of the algorithm with the text incremental clusteringalgorithm newly proposed show that the algorithm the paper proposed has someadvantages.2. The paper uses the method of combining the value of similarity and Euclideandistance to calculate the score between the new document and the existing cluster anduses the change of cluster features to judge the final cluster that the document belongsto. The experimental results also show that this method can effectively improve theeffect of the clustering.
Keywords/Search Tags:incremental clustering, k-means, text clustering, central moment, clusterfeatures
PDF Full Text Request
Related items