Research On Clustering Algorithm Of K-medoids And Its Application In Text Clustering

Posted on:2018-12-09

Degree:Master

Type:Thesis

Country:China

Candidate:L F Wang

Full Text:PDF

GTID:2348330518463018

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

Text clustering is to classify a given set of texts into multiple clusters,which aims at makes the similarity of documents in different categories as small as possible,but the similarity of documents in same categories has great in common.As an unsupervised machine learning method,since the clustering algorithm does not need the training process and does not need to manually mark the categories in advance,it has a certain degree of automation processing and high flexibility.And it has become an important means to abstract,guide and organize the textual information,which attracts a large number of people to concerned it.In the clustering of text,we mainly use the document vector space to indicate the text based on TF-IDF statistics,which involves text preprocessing,Chinese segmentation,feature extraction and weight calculation,clustering algorithm,clustering performance evaluation and other process.The calculation of weight value for feature item and the selection of clustering algorithm are two significant aspects in the text clustering algorithm based on vector space model,which are related to the clustering effect of the text.For the traditional feature weight calculation method only consider the feature frequency and inverse document frequency,it ignores the impact of the category of the document on the weight of the feature and maybe there is not a standard classification dataset in practical applications,this paper proposed a newly method to calculate the weight of feature combined with category and semantic contribution.Firstly,it combined the semantic contribution and the fuzzy clustering proposed in this paper to process the texts collection without the category information into texts with category information collection by rough clustering.Then,the traditional TF-IDF weight calculation method was improved by combining the class information entropy and semantic contribution degree,thus to get better calculation method of feature weight.By using the Chinese text classification corpus of Chinese Natural Language Processing open platform of Fudan University to evaluate it,the results show that the new calculation method is superior to the traditional weight calculation method.For the defect of K-medoids algorithm which was sensible to the initial center and improper initial center point selection may lead to clustering effects to local optimum,this paper proposed a new selection of initial centers algorithm through radius adaptive.The algorithm calculated each radius according to distribution of the remaining sample in each iteration,thus to dynamically calculate local variance and neighborhood radius for corresponding sample and then select the optimum initial cluster centers to achieve better clustering effect.Using different size of UCI data sets and simulated data sets with different ratios of random points for testing and using the five general clustering index to evaluate its performance,the results show that the performance of this algorithm is better than other similar algorithms.Finally,the improved text clustering algorithm is designed as a text clustering system which shows the entire process,then we compared the experiment results in the system.

Keywords/Search Tags:

K-medoids, text clustering, feature weight, information entropy, initial clustering center

PDF Full Text Request

Related items

1	Precise Clustering Algorithm For Chinese Text Based On K-means
2	Research On K - Medoids Algorithm And External Clustering Evaluation Index Of Variance Optimization Initial Cluster Center
3	Clustering Algorithm Of Cluster Center Self-confirmation And Its Application In Document Clustering
4	Research And Application Of K-medoids Clustering Algorithm Based On ?_o-neighborhood Search Strategy
5	Research On Patent Text Clustering Based On Improved K-means Algorithm
6	Research On Text Clustering Based On Division And Hierarchy
7	Research On Non-IID K-Medoids Clustering Algorithm
8	Research And Implementation Of KFCM Algorithm Based On Bat Algorithm
9	Research On Text Clustering And Its Application In Topic Detection Analysis
10	Knn Text Classification Algorithm Based On The Semantics Of The Center