Font Size: a A A

Document Clustering Technology Based On Genetic Algorithm

Posted on:2007-01-06Degree:MasterType:Thesis
Country:ChinaCandidate:B LeFull Text:PDF
GTID:2178360185972807Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Document clustering is one of most important research topic in information retrieval (IR) and data mining (DM). Clustering, an unsupervised classifying methods, is the process of grouping together similar data into a number of clusters. Hierarchical clustering method can cluster accurately, but it needs lots of calculation, especially when the number and the dimension of samples are numeres. Stirred by the iterance in the mathematics, dynamic clustering method is invented, and it can deduce calculation. The operation of the dynamic clustering method is sensitive to the set of parameter, which should analysis the physical means of the sample. So when the number and the dimension of samples are numeres, it is too difficult to set parameters. To choice parameters is only denpended on lots of experiments, on other hand, the initial corpus and the target function are separated, and there may have some extremums. However, the used algorithm has not a mechanism to avoid the worse result. Thus the clustering results are sensitive to initial clustering centers and the order of input samples.Genetic algorithms, motivated by natural evolution, make use of evolutionary operators and a population of solutions to obtain the globally optimal partition of the data. This paper introduces the genetic algorithm to deal with the problem that some dynamic clustering algorithms are sensitive to initial solution and clustering centers converge at extremums. In this paper, the clustering center is encoded by binary, the sum of the Euclidean distances between the points and their respective centers is adopted as the fitness function and results are gained through selection, crossover and mutation. The experimental results on Reuters-21578 Top 10 show that: 1) this algorithm can gain good result effectively; 2) the clustering criterion function (Purity) defined over the entire clustering solution is good. However, how to apply the method on the Chinese corpse and what is the result when the algorithm operatorrs on the higher dimension are to be resolved in our future research.The main contribution of this paper is as follows:1, Given a new algorithm for the document clustering;2, Verify and analysis the algorithm on the real datasets.
Keywords/Search Tags:Document clustering, Genetic Algorithm, Dimension Reduction, Latent Semantic Index, Purity
PDF Full Text Request
Related items