Font Size: a A A

Research On Text Clustering Based On Text Dimension Reduction And Ant Colony Algorithm

Posted on:2017-04-09Degree:MasterType:Thesis
Country:ChinaCandidate:H T ZhangFull Text:PDF
GTID:2308330485963996Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As a technology branch of text mining, text clustering plays a more and more important role. Text clustering technology can classify similar text, it is convenient for people to dig out the potential and valuable information from the mass text information.In this paper, using Fudan University Chinese text corpus for text clustering research, using similarity of the text combined with corresponding clustering algorithm similar text will be clustering. Because the structural characteristics of Chinese text itself, so before text clustering we should text process,namely the text segmentation and remove useless stop words.This paper uses CAS segmentation system (ICTCLAS) for text segmentation and Harbin Institute of Technology stop list for filtering stop words.Then, in order to get the feature words which can express the content of the text, the feature words are selected.At last, the computer can not directly deal with the information of the non structure or the semi structure. In this paper, using vector space model (VSM) to represent text, term frequency-inverse document frequency (TF-IDF) value to represent the weight value of each feature word.Through analyzing the whole process of text processing, it is found that the vector space model can have the problem of high dimension and sparse data elements if this model is directly composed of the processed keywords. In order to solve the problem, this paper first divided into two steps to select the feature words,the first step is to use the chi square test to extract the feature words and get the set of feature words, the second step is clustering of feature words using Hierarchical clustering based on semantics, merge synonyms or similar words in the set of feature words, Then, the TF-IDF value of each word in the set of feature words is calculated,and the vector space model is generated.However, the model still has the defects of high dimension and sparse elements.In this paper,we use the method of singular value decomposition to find the Latent Semantic Space of the vector space model, realization dimension reduction of model and reduce the interference of noise points. Through the above method, keep the original characteristics of the model, effectively reduces the dimension of the matrix, improve the efficiency of text clustering.After reducing the dimension of the text, the next step is to select the appropriate text clustering algorithm.At present, there are many text clustering algorithms, according to the difference of clustering method, the method can be divided into partition, hierarchy, density and model.Traditional text clustering algorithms need to determine the number of clusters, no self organization,etc insufficient.So this paper adopts ant colony text clustering algorithm which can be implemented as the final text clustering algorithm.Through the analysis of the basic ant colony algorithm, we can know that there are many shortcomings in the basic algorithm, for example, the number of iterations is too much ants in the two-dimensional plane moving too random,etc these factors affect the convergence speed of the algorithm and the effect of text clustering. In this paper, based on the basic algorithm, several improved measures are adopted.modified algorithm termination conditions to the end of the algorithm is not simple to rely on the maximum number of iterations to avoid the meaningless loss of time;dynamic adjustment of the radius of the observation of ants so that ants in a linear decline in the way to adjust the observation radius the convergence speed and clustering results of the algorithm are balanced;three kinds of strategies are formulated to make the ants move in the purpose of picking up and putting down the text, which improving the effect of text clustering.Finally, through the relevant examples, this paper shows the realization of the text dimension reduction method which proves the feasibility of the method. Simulation experiment was carried out using the Chinese text corpus of Fudan University. Experimental results show that the improved ant colony text clustering algorithm not only accelerates the convergence speed of text clustering and improve the accuracy of the text clustering results, improve the effect of text clustering.
Keywords/Search Tags:text clustering, text dimension reduction, singular value decomposition, ant colony algorithm text clustering, vector space model
PDF Full Text Request
Related items