Research For Clustering Algorithm For Keywords Based On Coloring Spread

Posted on:2016-09-13

Degree:Master

Type:Thesis

Country:China

Candidate:Y He

Full Text:PDF

GTID:2308330479993912

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

With the advent of the era of big data and the development of Internet technology, it’s badly in need of digging out the useful knowledge in the big data, that is why the data analy-sis and data mining play an irreplaceable role. In recent years, with the rapid development of low-cost and large-scale cluster technology, like MPI, Map Reduce, Spark and so on, lots of data mining and machine learning algorithms have been wide ly used in large-scale date set and act as an indispensable part in people’s daily life and work. C lustering analysis is one of the most important method of data mining, and it is also one of the most important research content in data mining, machine learning and pattern recognition. It is an unsupervised ma-chine learning algorithms, in other words, it is no need to use the training set or guide info r-mation to recognize the relationship between objects.This paper proposes a clustering algorithm for keywords based on coloring spread, which is named as C lustering Based on Coloring Spread from C lustering Coefficient Maxima(CBCSCCM), and experiment respectively with keywords selected from 20 groups of Chi-nese news and English news. Firstly, do word segmentation from these articles(including the removal of stop words). Secondly, use word2 vec to train these words for getting word repre-sentation, obtain keywords of every category, whose TF-IDF value is in top k, and compute the cosine value between them. Then use these keywords and their cosine values to create a graph. Finally, apply our algorithm to cluster these vertices from the graph. The algorithm is similar to the clustering algorithm based on density, the higher the density between objects, the more they gathered together. It does not need to provide the value of K(the number of clusters), and it can be distributed implemented based on graph parallel computing model in order to handle vast amounts of data. In this paper, we use ODPS-graph to distributed imple-ment. At the same time, in order to reduce the impact of noise data, as well as optimize the clustering effect, we cluster layer by layer, which remains a part of edges ever y time, these weights of which aren’t less than a given threshold. What’s more, we do clustering experiment with these keywords selected from these news, and apply Purity, RI, ARI, NMI, F-measure to evaluate the clustering effect and compare them with K-means’, Normalized Spectral C lus-tering’s and CBFSAFODP’s. The result shows our algorithm achieves better effect.

Keywords/Search Tags:

data mining, keywords clustering, clustering coefficient, coloring spread, word2vec, TF-IDF, word representation, distributed, graph parallel computing, ODPS-graph

PDF Full Text Request

Related items

1	Research And Implementation Of Mapreduce-based Graph Clustering Algorithm
2	Research On Clustering Algorithm Based On Graph Coloring Theory
3	Distributed Clustering Of Graph Data And Application To Data Mining For E-commerce
4	Study On Graph Sampling Algorithm For Graph Clustering Characteristic
5	Parallel algorithms for large-scale graph clustering on distributed memory architectures
6	Research On Clustering Algorithm For Clusters With Irregular Structure
7	Research And Application Of Bidding Data Mining Based On Graph Clustering
8	Design And Implementation Of Parallel Peer Pressure Map Clustering Algorithm Based On Linear Algebra
9	Application Research On Graph Mining Based On Structure And Attribute
10	Research And Implementation Of Graph Mining Platform Based On Pregel-like Framework