Font Size: a A A

Research For Clustering Algorithm For Keywords Based On Coloring Spread

Posted on:2016-09-13Degree:MasterType:Thesis
Country:ChinaCandidate:Y HeFull Text:PDF
GTID:2308330479993912Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the advent of the era of big data and the development of Internet technology, it’s badly in need of digging out the useful knowledge in the big data, that is why the data analy-sis and data mining play an irreplaceable role. In recent years, with the rapid development of low-cost and large-scale cluster technology, like MPI, Map Reduce, Spark and so on, lots of data mining and machine learning algorithms have been wide ly used in large-scale date set and act as an indispensable part in people’s daily life and work. C lustering analysis is one of the most important method of data mining, and it is also one of the most important research content in data mining, machine learning and pattern recognition. It is an unsupervised ma-chine learning algorithms, in other words, it is no need to use the training set or guide info r-mation to recognize the relationship between objects.This paper proposes a clustering algorithm for keywords based on coloring spread, which is named as C lustering Based on Coloring Spread from C lustering Coefficient Maxima(CBCSCCM), and experiment respectively with keywords selected from 20 groups of Chi-nese news and English news. Firstly, do word segmentation from these articles(including the removal of stop words). Secondly, use word2 vec to train these words for getting word repre-sentation, obtain keywords of every category, whose TF-IDF value is in top k, and compute the cosine value between them. Then use these keywords and their cosine values to create a graph. Finally, apply our algorithm to cluster these vertices from the graph. The algorithm is similar to the clustering algorithm based on density, the higher the density between objects, the more they gathered together. It does not need to provide the value of K(the number of clusters), and it can be distributed implemented based on graph parallel computing model in order to handle vast amounts of data. In this paper, we use ODPS-graph to distributed imple-ment. At the same time, in order to reduce the impact of noise data, as well as optimize the clustering effect, we cluster layer by layer, which remains a part of edges ever y time, these weights of which aren’t less than a given threshold. What’s more, we do clustering experiment with these keywords selected from these news, and apply Purity, RI, ARI, NMI, F-measure to evaluate the clustering effect and compare them with K-means’, Normalized Spectral C lus-tering’s and CBFSAFODP’s. The result shows our algorithm achieves better effect.
Keywords/Search Tags:data mining, keywords clustering, clustering coefficient, coloring spread, word2vec, TF-IDF, word representation, distributed, graph parallel computing, ODPS-graph
PDF Full Text Request
Related items