Font Size: a A A

Research On Clustering Algorithm Of K-medoids And Its Application In Text Clustering

Posted on:2018-12-09Degree:MasterType:Thesis
Country:ChinaCandidate:L F WangFull Text:PDF
GTID:2348330518463018Subject:Engineering
Abstract/Summary:PDF Full Text Request
Text clustering is to classify a given set of texts into multiple clusters,which aims at makes the similarity of documents in different categories as small as possible,but the similarity of documents in same categories has great in common.As an unsupervised machine learning method,since the clustering algorithm does not need the training process and does not need to manually mark the categories in advance,it has a certain degree of automation processing and high flexibility.And it has become an important means to abstract,guide and organize the textual information,which attracts a large number of people to concerned it.In the clustering of text,we mainly use the document vector space to indicate the text based on TF-IDF statistics,which involves text preprocessing,Chinese segmentation,feature extraction and weight calculation,clustering algorithm,clustering performance evaluation and other process.The calculation of weight value for feature item and the selection of clustering algorithm are two significant aspects in the text clustering algorithm based on vector space model,which are related to the clustering effect of the text.For the traditional feature weight calculation method only consider the feature frequency and inverse document frequency,it ignores the impact of the category of the document on the weight of the feature and maybe there is not a standard classification dataset in practical applications,this paper proposed a newly method to calculate the weight of feature combined with category and semantic contribution.Firstly,it combined the semantic contribution and the fuzzy clustering proposed in this paper to process the texts collection without the category information into texts with category information collection by rough clustering.Then,the traditional TF-IDF weight calculation method was improved by combining the class information entropy and semantic contribution degree,thus to get better calculation method of feature weight.By using the Chinese text classification corpus of Chinese Natural Language Processing open platform of Fudan University to evaluate it,the results show that the new calculation method is superior to the traditional weight calculation method.For the defect of K-medoids algorithm which was sensible to the initial center and improper initial center point selection may lead to clustering effects to local optimum,this paper proposed a new selection of initial centers algorithm through radius adaptive.The algorithm calculated each radius according to distribution of the remaining sample in each iteration,thus to dynamically calculate local variance and neighborhood radius for corresponding sample and then select the optimum initial cluster centers to achieve better clustering effect.Using different size of UCI data sets and simulated data sets with different ratios of random points for testing and using the five general clustering index to evaluate its performance,the results show that the performance of this algorithm is better than other similar algorithms.Finally,the improved text clustering algorithm is designed as a text clustering system which shows the entire process,then we compared the experiment results in the system.
Keywords/Search Tags:K-medoids, text clustering, feature weight, information entropy, initial clustering center
PDF Full Text Request
Related items