Document Clustering Technology Based On Genetic Algorithm

Posted on:2007-01-06

Degree:Master

Type:Thesis

Country:China

Candidate:B Le

Full Text:PDF

GTID:2178360185972807

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Document clustering is one of most important research topic in information retrieval (IR) and data mining (DM). Clustering, an unsupervised classifying methods, is the process of grouping together similar data into a number of clusters. Hierarchical clustering method can cluster accurately, but it needs lots of calculation, especially when the number and the dimension of samples are numeres. Stirred by the iterance in the mathematics, dynamic clustering method is invented, and it can deduce calculation. The operation of the dynamic clustering method is sensitive to the set of parameter, which should analysis the physical means of the sample. So when the number and the dimension of samples are numeres, it is too difficult to set parameters. To choice parameters is only denpended on lots of experiments, on other hand, the initial corpus and the target function are separated, and there may have some extremums. However, the used algorithm has not a mechanism to avoid the worse result. Thus the clustering results are sensitive to initial clustering centers and the order of input samples.Genetic algorithms, motivated by natural evolution, make use of evolutionary operators and a population of solutions to obtain the globally optimal partition of the data. This paper introduces the genetic algorithm to deal with the problem that some dynamic clustering algorithms are sensitive to initial solution and clustering centers converge at extremums. In this paper, the clustering center is encoded by binary, the sum of the Euclidean distances between the points and their respective centers is adopted as the fitness function and results are gained through selection, crossover and mutation. The experimental results on Reuters-21578 Top 10 show that: 1) this algorithm can gain good result effectively; 2) the clustering criterion function (Purity) defined over the entire clustering solution is good. However, how to apply the method on the Chinese corpse and what is the result when the algorithm operatorrs on the higher dimension are to be resolved in our future research.The main contribution of this paper is as follows:1, Given a new algorithm for the document clustering;2, Verify and analysis the algorithm on the real datasets.

Keywords/Search Tags:

Document clustering, Genetic Algorithm, Dimension Reduction, Latent Semantic Index, Purity

PDF Full Text Request

Related items

1	Latent Semantic Retrieval Based On Document Clustering Analysis
2	Researching The Application Of Latent Semantic Index To Chinese Document Clustering
3	Study And Implementation On Latent Semantic Space Analysis And Web Document Clustering Based On LDA
4	The Research Of Index Techonology Based On Semantic Web Document
5	Research On Document Clustering Technology Based On Latent Semantic Indexing
6	Understanding Of Web-based Document Inverted Row Of Full-text Index Research And Realization
7	Research Of Clustering And Characteristic Dimension-reduction Based On Immune Genetic Algorithm
8	Latent Semantic Analysis-based Spam Filtering System Design And Realization
9	Unsupervised Clustering Algorithm Based On Dimension Reduction
10	Desing And Implementation Of Clustering Analysis Algorithm Based On Dimension Reduction