Font Size: a A A

On Improving Ant-based Text Clustering Algorithm

Posted on:2007-05-24Degree:MasterType:Thesis
Country:ChinaCandidate:S G WangFull Text:PDF
GTID:2178360182984071Subject:Systems analysis and integration
Abstract/Summary:PDF Full Text Request
Data clustering is an old but challenging research topic. With today's rapid explosion of textual information over the Internet, the requirement of obtaining information from huge amount of texts is rapidly increasing as well. Research on the text clustering technology has consequently obtained great attention. In recent years, inspired by the corpse and larval-sorting activities observed in real ant colonies, ant-based clustering algorithm has been introduced, following the pioneering work of Deneubourg et. al.. The combination of the ant clustering technology and the text clustering technology leads to the development of ant-based text clustering algorithms.In this thesis, the standard ant clustering algorithm and some typical variants are analyzed;and the conclusion is that the performance of these algorithms is not so satisfactory in various situations. The limitations of these algorithms can be recognized in two aspects. On the one hand, it is argued in the thesis that the ants' behavior patterns can be modified to pursue better algorithmic performance. For example, too many random factors of ant activities are introduced in the standard ant clustering algorithm and this would probably hamper convergence of the algorithm. Moreover, the ants' status is not adjustable in response to the changes of their local environment. On the other hand, the similarity measurement is not accurate enough in the current ant clustering algorithms. Adopting the VSM-based keywords matching method, the current ant clustering algorithms largely neglect the semantic connections amongst the words.With regard to these two issues, this thesis attempts to develop an extended ant-based text clustering algorithm in order to improve the algorithmic performance and the usefulness in real applications. The major contributions are as follows:1. Extension of the ant clustering algorithm. The ant behavior scheme is redesigned by adding two caches so as to facilitate the ants to picking-up and dropping data items more effectively. The picking-up and dropping thresholds, as well as the neighborhood range, are set to be adjustable to furthermore promote the algorithm to converge efficiently and precisely.2. Semantic based text similarity measurement. An ontology-based text similarity measurement has been introduced in order to improve the precision of the clustering;and WordNet-based implementation is further discussed in the thesis.Finally, the proposed algorithm is implemented and testing experiments are conducted with 50 documents excerpted from the Reuters-21578 standard corpus, comparing with the clustering results of the standard ant clustering algorithm and the standard k-means algorithm. With respect to the precision and recall of clustering, the experiments indicate that the proposed algorithm is with better performance than the standard ant clustering algorithm and the k-means algorithm.
Keywords/Search Tags:Text clustering, ant clustering algorithm, semantic similarity, ontology, WordNet
PDF Full Text Request
Related items