Font Size: a A A

An Incremental Text Clustering Algorithm Based On Cluster Cohesion

Posted on:2014-11-12Degree:MasterType:Thesis
Country:ChinaCandidate:S Y TaoFull Text:PDF
GTID:2268330401488306Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Recently, with the rapid development of Internet technology, more and more dataappears on the line, people need some useful tools to organize and manage textinformation urgently; text clustering is such an important technology. Yet, thetraditional methods process all objects at the same time, if there is data need to beupdated or added, they have to re-cluster the entire data set. Clearly, this mode is onlysuitable for static data sample, not suitable for the dynamic situation, besides the timecomplexity is very high. Incremental clustering is more suitable for dynamic situation,which is the expansion of existing clustering result, in the case of constantly updatedor growth of the target data, it is possible to avoid a lot of repeated calculations,reduce the processing time, and also improve the effect of clustering ultimately.This paper presents a new incremental text clustering algorithm based on clustercohesion. Firstly use the Word Net to calculate semantic similarity between every twolexical items. When calculate the occurrence frequency of a word, we will add theoccurrence frequency of other words with the similar meaning, so that we can assignfrequency weight for words more accurately. Then calculate cohesion between eachnew text with all existing cluster. It not only calculates the similarity between the textand clusters’ center, but also takes the similarity between each cluster into account.The algorithm will add each text to the cluster which gets the highest cohesion score,and then update the cluster’s center, mean, variance and other feature information. Inorder to further improve the clustering performance, whenever the incrementalprocessing is done, we’ll reassign text whose category is not certain with the samemanner. If there are still some texts cannot sure their exact category, we’ll add them tothe clusters which get the largest cohesion. Yet at this time, we do not change thecenter or other cluster feature information of the cluster. By doing this, we can avoidthe consequences of misclassification bias.The main work of this article is as follows:1、we use the word semantic similarity based document model, it is not onlystatistics the occurrence frequency of each word, but also add the occurrencefrequency of similar words, according to the gloss on the WordNet of the two words,so we can allocation of the frequency weighting of the lexical items more accurately.2、 we propose a new incremental text clustering algorithm based on cluster cohesion, and use a new method to calculate the cohesion between text and cluster.Our algorithm is proved experimentally on20newsgroups data sets, and compared tok-means, which is a classic clustering algorithm, as well as an incremental clusteringalgorithm based on the similarity histogram, which is the recently proposed. Theevaluation methods are respectively: purity, entropy and normalized mutualinformation. In addition, we also analyze the impact of each threshold value. Theexperiment results show that, according to the three evaluation indicators, the overalleffect of the proposed algorithm is superior to the comparison algorithms. Besides, thecomplexity of computation time is also greatly reduced.
Keywords/Search Tags:incremental clustering, cluster features, semantic similarity, cohesion
PDF Full Text Request
Related items