Font Size: a A A

A Study On Chinese Text Clustering Based On Ant Colony Algorithm

Posted on:2010-02-03Degree:MasterType:Thesis
Country:ChinaCandidate:J ShenFull Text:PDF
GTID:2178330338975894Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The information amount on Internet such as news, e-books and other network information source is quickly increasing. How to make use of such huge digital information collection efficiently becomes one of the important problems we are facing. In order to organize the information, supervised category methods are introduced to classify information. But all these methods have an intrinsic disadvantage, that is, it needs human intervene in order to obtain qualified results. To overcome the shortcomings of these category methods, this thesis mainly focuses on applying unsupervised clustering methods of text mining to large-scale text data set.In this thesis, firstly, we present a brief review on text clustering technology development. Some investigations on Chinese text pre-processing and text clustering algorithms are also presented. The former part includes word segmentation, text feature extraction and text similarity measurement, etc. The later part mainly focuses on comparing and analyzing the existing text clustering algorithms but not limited, K-means clustering algorithm, agglomerate clustering algorithm, density based clustering algorithm, genetic based clustering algorithm and ant colony based clustering. Because of the particularity of clustering algorithm on Chinese text, this thesis utilizes multi-level partitioned word library when applies quick string-match-based word segmentation and adopts compressing storage model when processes the text information during the whole clustering procedure.Secondly, the deep research on text clustering based on ant colony algorithm and agglomerate algorithm. Then we propose a text clustering method combined ant colony algorithm and agglomerate clustering algorithm, which further improve the efficiency of clustering process. The following methods are used for overcoming the disadvantages of the ant colony based clustering algorithm. Compact algorithm is adopted while ant dropping position that does not connect to others. We also propose evaluation-function-based object picking up method to avoid selecting object randomly. In order to solve the stop condition problem and obtain much more concise stop condition, user defined expected clustering number and intra-cluster distance, inter-cluster distance are combined for stop decision. Adopt a dynamic parameter instead of static ant dropping threshold, this method reduces the complexity of random formula calculation in the basic ant colony based algorithm. In order to accelerate the convergence process, we integrate the agglomerate algorithm into ant colony based clustering algorithm framework using its quick processing strong point.Finally, standard data sets and real text sets are used for testing the combined clustering algorithm based on ant colony and agglomerate algorithms. F-measure and clustering-time cost are used for analyzing the clustering result and evaluating the proposed algorithm. The experimental results show that the proposed hybrid clustering algorithm has an obvious advantage while it is applied to large-scale text clustering problem.
Keywords/Search Tags:text clustering, ant colony algorithm, agglomerate algorithm, text mining, Chinese information process
PDF Full Text Request
Related items