Font Size: a A A

Research Of Text Clustering Based On Self-Organizing Maps

Posted on:2008-05-09Degree:MasterType:Thesis
Country:ChinaCandidate:K G LuoFull Text:PDF
GTID:2178360245497893Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Basing on an analysis of some classical text clustering algorithms, in view of SOM's many superiorities, such as topology preservation, noise toleration and etc, this paper utilizes SOM (Self-Organizing Maps) as the overall framework for text clustering, and explores the characteristics of text clustering based on SOM, the main problems in them and corresponding solutions. Furthermore, we explore to apply the text clustering in search engine by a great of experiments. The main objective is to study suitable text clustering method with self-adaptation capability and to improve the current clustering algorithm, which can reflect the topic structure of input documents.By an analysis of the performance and the training methods of Kohonen SOM network, the major work of this paper based on Self-Organizing Maps is as follows:First, in response to the high-dimensional and semantic correlation characteristics of text clustering, a dynamic SOM clustering algorithm based on latent semantic indexing is implemented in this paper. This method reduces the dimension of the original document-term matrix by the Singular Value Decomposition method. Better clustering results can be achieved, and the speed of clustering can be improved. The algorithm uses statistical methods for text clustering. It combines the language rules and statistics in order to gain a better natural language understanding performance.Second, a dynamic SOM algorithm of incremental gradient descent to cluster large-scale document collection is proposed to solve the slow growth problem of the other dynamic SOM network. In comparison with other dynamic SOM algorithms (e.g. GHSOM), the size of output layer in our algorithm can be gradually reduced by inserting suitable number of neurons, thus the number of underutilized neurons can be reduced greatly and the training results of this algorithm can fully represent the distribution of topics in document collection. In addition, when using this algorithm to cluster massive documents, the computation cost can also be shortened remarkably. The overused neurons can be split again to optimize the clustering results further. A good result of clustering can be gained.Third, a SOM clustering algorithm based on the sparse character of vectors is presented in this paper. The algorithm first scans all the file vectors twice, forward and backward. Then it initializes the number of neurons and their characteristic words, and fixes up the number of the non-zero dimensions with a constant. The SOM network merges the similar neurons or inserts the new neurons when it is necessary through training the SOM network until the end of cycle. In comparison with GHSOM, this method gains a better clustering result, and the computation cost can be shorten greatly. And the space complexity can be reduced significantly at the same time. The larger text sets are inputted, the sparser the vector of neurons and documents are, and the better the algorithm performance is.Finally, this paper explores the application of large-scale text clustering in the search engine in order to gain a more in-depth understanding of clustering search. This paper introduces the basic theory of the clustering search engine, and the evaluation criterions of a good clustering search engine. The paper has designed a simple clustering search system which can solve the problem of the clustering search categories description and improve the clustering speed greatly.
Keywords/Search Tags:Self-Organizing Maps, text clustering, the sparse character of vectors, Latent Semantic Indexing, clustering search
PDF Full Text Request
Related items