Font Size: a A A

K-means Text Clustering Algorithm Based On Double Genetic Algorithm In Text Mining

Posted on:2018-12-06Degree:MasterType:Thesis
Country:ChinaCandidate:J WenFull Text:PDF
GTID:2348330542467834Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
The explosive growth of information resources will push the text mining technology into a boom.More and more people want to be able to get useful information quickly from a large number of information resources.Text clustering as a key technology in text mining plays an important role in finding useful information from a large number of text messages.Study the text clustering algorithm,found the problems in the algorithm,and targeted for improvement has become the focus of many scholars doing research in recent years.k-means algorithm as the most classical text clustering algorithm is widely used because of simple operation,fast convergence and other advantages.But it has two significant shortcomings,which are sensitive to the number of clustering and sensitive to the initial center points.This paper carefully studies a large number of literature and related theoretical knowledge,do have a certain understanding of the current domestic and international research status,then proposes a k-means text clustering algorithm based on double genetic algorithm,TCDGK algorithm,with the characteristics of genetic algorithm that has strong optimization.The core idea of TCDGK algorithm is that the outer genetic algorithm is used to control the number of clusters,and the inner genetic algorithm is used to control the initial center point to realize the dual optimization of the two random factors.In order to enhance the availability of the algorithm,hierarchical coding strategy is adopted in the algorithm according to the different of internal and external control parameters,the external layer uses binary coding,while the inner layer uses decimal coding.The clustering results are evaluated by the DBG and DIG,and the concept of H value is proposed which is used as the fitness function of the genetic algorithm.The target of the algorithm is that the best clustering number and the best initial center point can be obtained at the same time at the end of algorithm.In order to prove the performance of TCDGK algorithm,the Iris data set and Glass data set are used as the test data,and comparing the results with the other five algorithms,and the accuracy rate,recall rate,F value and purity were used as evaluation indexes.In order to verify the application of TCDGK algorithm in text mining,the Chinese corpus of Fudan University is used as the experimental data of this text mining.In the experiment,we segmenting these text data,remove the stop words and extracting characteristic words,and next doing clustering algorithm for text processing results.Then comparing the results of the traditional K-means algorithm and TCDGK algorithm to prove the validity and usability of the TCDGK algorithm.Finally,the experiment process is demonstrated by the visual interface.
Keywords/Search Tags:Double genetic, K-means algorithm, Layered coding, Text clustering, text mining
PDF Full Text Request
Related items