Font Size: a A A

Research On Incremental Clustering Algorithm

Posted on:2020-08-13Degree:MasterType:Thesis
Country:ChinaCandidate:L Y YaoFull Text:PDF
GTID:2428330578964136Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Cluster analysis technology is an important part of data mining technology and it has a wide range of applications in many fields.As the data continues to grow,how to efficiently obtain information from massive data becomes the focus of clustering algorithms today.Traditional clustering algorithms cannot obtain all data before clustering,which leads to the lack of timeliness of clustering results,and they are not suitable for big data environments now.Therefore,the incremental clustering algorithm has become an important research direction.When processing dynamic data,it takes a lot of time and resources to re-cluster the new data together with the original data.Therefore,from the perspective of processing dynamic data sets,this paper uses the existing clustering model to process the incremental data without clustering the original data repeatedly.And this paper is improved on the basis of traditional clustering algorithms and the existing incremental clustering algorithm so that these algorithms can process dynamic data better and faster.This paper aims at the poor clustering ability of the traditional clustering algorithm for dynamic data sets,and uses the advantages of the traditional algorithm to improve or design new incremental clustering algorithms.The main research work of this paper is as follows:The design of the incremental K-Means clustering algorithm for processing data one by one and the related research on the selection of initial center points.First,an incremental clustering method is designed with reference to the idea of K-Nearest Neighbor.A data sample of unknown class should be consistent with the majority of the data samples in its neighbors.As new data points continue to grow,in addition to considering the partition of new points into a known cluster,the impact of the incremental data on the original clustering model should also be considered.When the new data reaches a certain amount,this paper considers the influence of incremental data on the clustering model,and uses the cluster feature to judge whether to merge or split the cluster.When a new sample point does not satisfy the condition of joining an existing cluster,a new cluster is formed,and when the sample point of the new cluster is much smaller than other clusters,it will be treated as noise.Secondly,considering the influence of the choice of centroids on the initial clustering model when using K-Means algorithm to cluster the initial data,this paper uses a new initial center points selection method,so that the initial center points are located on the convex hull boundary of the data-intensive area to get a better initial clustering model.On one hand,it realizes incremental processing of dynamic data and enables real-time updating of data models using initial clustering results.On the other hand,this algorithm can ensure the clustering accuracy.Design for batch incremental fuzzy clustering algorithm and research on how to deal with sparse high-dimensional data.Fuzzy c-means clustering algorithm is simple,anditerative speed is fast.But it can only handle low-dimensional and small-scale data.By using these advantage of the algorithm,this paper uses the method of block and sampling to carry out incremental expansion,and proposes three incremental fuzzy clustering algorithms suitable for sparse high-dimensional big data called spHF(c+l)M algorithm,oHF(c+l)M algorithm and rseHF(c+l)M algorithm.The spHF(c+l)M algorithm and the oHF(c+l)M algorithm divide the data into blocks,and the rseHF(c+l)M algorithm samples the data.When running the fuzzy c-means algorithm on each block or sampled data block,first add the sample weights to increase the clustering effect.Then,use new objective functions that consider the interaction between the centroids to iterate which will improve clustering accuracy.After that,the algorithm normalizes the centroids in each iteration and uses the cosine distance to calculate the similarity,making the algorithm more suitable for sparse high-dimensional data sets.Under the condition of limited computer memory,the algorithm can realize block processing of very large data sets accurately and efficiently.The effectiveness of the three algorithms is verified on large-scale English text data.At last,this paper briefly introduces the process of Chinese text clustering and the processing of text information.The new incremental K-Means clustering algorithm and extensible incremental fuzzy clustering algorithms are applied to Chinese text analysis.The experimental results show that the improved incremental clustering algorithms work well in dealing with dynamic Chinese text data sets.
Keywords/Search Tags:incremental clustering, dynamic datasets, K-nearest neighbor, extensible clustering
PDF Full Text Request
Related items