Font Size: a A A

Research On Document Clustering Algorithm Based On K-means

Posted on:2015-04-23Degree:MasterType:Thesis
Country:ChinaCandidate:J Q ZhouFull Text:PDF
GTID:2308330452956836Subject:Software engineering
Abstract/Summary:PDF Full Text Request
When people are faced with a large number of documents, they hope to deal withthem by category. For organizing scientific papers, providing navigation and retrievalconvenience in learning science and technology, doing research work and carring out theacademic exchange activities, and improving efficiency and avoid duplication of effort,study on clustering analysis of scientific papers is needed, the related technology ofclustering, the main document clustering algorithms, design of the improved K-Meansalgorithm and design of the improved feature selection methods are described.In various clustering algorithms, K-Means algorithm is a simple, fast one with goodresults. However, at the same time, during a K-Means clustering, the initial cluster centersimpact on the clustering results seriously, which may lead to a local optimum, so that theclustering results may be poor and unstable, through the study, the use of Canopyalgorithm as a preprocessing step, which could make the distribution of the initial clustercenters more dispersed is proposed, in order to optimize the quality of clustering results.For the characteristics of scientific papers, select the appropriate method to vectorizeand measure the distance, beyond the general process of segmenting Chinese words,removing the stopwords on the basic stopwords list, a feature selection method ofsegmenting better, removing the paper common meaningless words and re-calculatingtitle’s, abstract’s and keywords’ weight values separately is proposed, to make the featurevector more accurately represent the subject of the document.A Java program is developed to validate the results, including document processing,feature selection and clustering algorithm parts, the document processing and featureselection parts implement the general method of feature selection and the improvedmethod through the study, the clustering algorithm part implements the traditionalK-Means algorithm and the improved algorithm with Canopy algorithm as a preprocessingstep. Select the document data, put the paper title, abstract and keywords into the testdocuments, analyze the documents by clustering after the preprocessing step. Test resultsshow good effects of the improved feature selection method and the improved clusteringalgorithm.
Keywords/Search Tags:Document Clustering, K-Means Algorithm, Canopy Algorithm, Feature Selection
PDF Full Text Request
Related items