Research On Document Clustering Algorithm Based On K-means

Posted on:2015-04-23

Degree:Master

Type:Thesis

Country:China

Candidate:J Q Zhou

Full Text:PDF

GTID:2308330452956836

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

When people are faced with a large number of documents, they hope to deal withthem by category. For organizing scientific papers, providing navigation and retrievalconvenience in learning science and technology, doing research work and carring out theacademic exchange activities, and improving efficiency and avoid duplication of effort,study on clustering analysis of scientific papers is needed, the related technology ofclustering, the main document clustering algorithms, design of the improved K-Meansalgorithm and design of the improved feature selection methods are described.In various clustering algorithms, K-Means algorithm is a simple, fast one with goodresults. However, at the same time, during a K-Means clustering, the initial cluster centersimpact on the clustering results seriously, which may lead to a local optimum, so that theclustering results may be poor and unstable, through the study, the use of Canopyalgorithm as a preprocessing step, which could make the distribution of the initial clustercenters more dispersed is proposed, in order to optimize the quality of clustering results.For the characteristics of scientific papers, select the appropriate method to vectorizeand measure the distance, beyond the general process of segmenting Chinese words,removing the stopwords on the basic stopwords list, a feature selection method ofsegmenting better, removing the paper common meaningless words and re-calculatingtitle’s, abstract’s and keywords’ weight values separately is proposed, to make the featurevector more accurately represent the subject of the document.A Java program is developed to validate the results, including document processing,feature selection and clustering algorithm parts, the document processing and featureselection parts implement the general method of feature selection and the improvedmethod through the study, the clustering algorithm part implements the traditionalK-Means algorithm and the improved algorithm with Canopy algorithm as a preprocessingstep. Select the document data, put the paper title, abstract and keywords into the testdocuments, analyze the documents by clustering after the preprocessing step. Test resultsshow good effects of the improved feature selection method and the improved clusteringalgorithm.

Keywords/Search Tags:

Document Clustering, K-Means Algorithm, Canopy Algorithm, Feature Selection

PDF Full Text Request

Related items

1	Research On Parallel Clustering Algorithm Based On Hadoop Cloud Computing Platform
2	Research On Hot Topics Discovery In Microblog Based On Distributed K-means Algorithms
3	Research And Application Of Multilayer Feature Selection Algorithm Based On Clustering
4	The Research And Application Of Clustering Feature Selection Methods
5	Clustering Algorithm Of Cluster Center Self-confirmation And Its Application In Document Clustering
6	Research Of K-means Clustering Algorithm Based On MapReduce
7	Research On Problems Related To The Initial Center Selection In K-means Clustering Algorithm
8	Study On Text Fuzzy Clustering Method Based On The Improved Feature Selection With TFIDF-GA
9	Based On K-means The Chinese Text Clustering Algorithm
10	Research On Distributed Clustering Algorithm Based On MapReduce