Font Size: a A A

Clustering Algorithm Of Cluster Center Self-confirmation And Its Application In Document Clustering

Posted on:2021-02-02Degree:MasterType:Thesis
Country:ChinaCandidate:S F ChenFull Text:PDF
GTID:2428330629980444Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In daily life,we are exposed to a lot of information in the form of text storage for dissemination,such as blog,weibo,web pages.With the advent of the era of big data,the amount of text data on the Internet is increasing exponentially.Most of these text data is stored in unstructured form,but they contain a lot of potentially useful information.How to get useful potential information from these text data quickly and efficiently,which requires the use of data mining technology.Clustering refers to discovering the potential groups in the data set through some rules,and the data objects in these groups have a high degree of similarity.It can be said that clustering is an important technical means of data mining,as well as an unsupervised method to obtain potential information from data sets.Currently,clustering is used in many fields,including anomaly detection,artificial intelligence and computer vision.Text clustering,as an important part of clustering analysis,has been able to use a variety of clustering methods.K-medoids,as a heuristic partition method,has been applied in many practical applications and achieved good results due to its simple implementation and less influence by outliers.However,the traditional K-medoids algorithm has some limitations,such as the random selection of the initial cluster center will affect the clustering result.The main contents of thesis are:?1?Aiming at the three defects of clustering by fast search and find of density peaks with high computational complexity,depending on truncation interval(9(?8? and needing artificial decision cluster center,a cluster center self-confirmation clustering algorithm based on residuals and density grids is proposed.The algorithm first replaces data objects with grid objects,then calculates the distance value and density value of grid objects.Finally,residual analysis is used to determine the cluster center automatically.Laboratory result of factitious data set and UCI real data set show that this algorithm can select the initial cluster center and determine the number of cluster center well,and the cluster result is better than DPC algorithm.?2?Aiming at the problem that the clustering result of K-medoids algorithm changes with the change of K value and initial cluster center,an improved K-medoids algorithm based on density weight Canopy is proposed.This algorithm first calculates the density of each sample object,selects the one with the highest density as the first cluster center.Then remove all sample objects belonging to the cluster center.Finally,according to the weight of each sample object,the next cluster center is selected until the data set is empty.Experimental results of UCI real data set and artificial data set show that this algorithm can well determine the number of class clusters and select reasonable initial cluster center,and can improve the accuracy and stability of the clustering algorithm.?3?Aiming at the problem that traditional text clustering ignores the semantic relationship among feature words and the high dimension of data,a text clustering method combining DWCK-medoids algorithm and frequent itemset is proposed.This method first uses the feature selection method to filter out redundant feature items,then the frequent itemset is extracted from it.Next the text representation model is constructed by using the frequent itemset,and the Euclidean distance is used to calculate the similarity.Finally,DWCK-medoids algorithm is used to perform clustering operations,and the clustering results are described.Experimental results show that this algorithm can get better clustering effect on text clustering.
Keywords/Search Tags:Text clustering, Canopy algorithm, K-medoids algorithm, DWC_K-medoids algorithm
PDF Full Text Request
Related items