Font Size: a A A

Research Of Text Clustering Based On Genetic Algorithm

Posted on:2010-07-13Degree:MasterType:Thesis
Country:ChinaCandidate:L YangFull Text:PDF
GTID:2178330338976288Subject:Computer applications
Abstract/Summary:PDF Full Text Request
Text Clustering, one of the most important research braches of clustering, is the application of clustering algorithm in Text Processing. Facing the massive volume and high dimensional text data, how to build effective algorithm for text clustering is one of research directions of data mining.Text data are unique, that is unstructured text form, making the text with the character of high-dimensional and sparse nature. Synonyms and polysemy problems are unique phenomena to natural language text data. These problems make the text clustering with high time complexity, and interfere with the accuracy of the clustering algorithm, making sharp decline in the performance of text clustering.First, in this paper, the combination of latent semantic indexing and genetic algorithm is for the purpose of eliminating these problems. In Latent Semantic Indexing, Singular Value Decomposition makes the original feature space transform into a corresponding smaller latent semantic space, so that you can eliminate the diversity of usage of words and randomness of expressions. Genetic algorithm optimization feature selection can be in the absence of a priori knowledge of the circumstances of the feature vectors to achieve the purpose of further dimension reduction, thereby reducing the clustering complexity.Second, in the study of clustering algorithm, this paper presents a variable-length chromosome genetic algorithm based on the K-center clustering algorithm. As the K-means algorithm on outlier-sensitive, this paper adopts the basic K-center clustering algorithm. K-center algorithm also requires pre-determined K values, while the value of clustering results is highly dependent on K value. Using variable-length encoding chromosome genetic algorithm clustering, clustering algorithm is not limited to the initial population of good and bad.Last, the simulation results show that the genetic algorithm to optimize dimension reduction is advantageous, and, comparing the experimental analysis shows the improved the effectiveness of the algorithm proposed in this paper, drawing the conclusion that the improved algorithm is superior to other algorithms.
Keywords/Search Tags:text clustering, feature selection, latent semantic indexing, GA, K-center algorithm, improved K-center algorithm
PDF Full Text Request
Related items