The Analysis Of K-means Cluster Algorithm For Website Content

Posted on:2012-09-15

Degree:Master

Type:Thesis

Country:China

Candidate:Y J Shi

Full Text:PDF

GTID:2268330425997283

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

In recent years, with the rapid development of Internet and life informatization level enhances increasingly, text data resources grow with surprising speed,which make it difficult to obtain the objective information, and the utilization rate of the information down. High-dimensional data becomes the mainstream day by day. So in the practical application of clustering, the research on high-dimensional data clustering method plays the more and more important role. But some unique characteristics of the high dimensional data itself, make high dimensional data mining is very difficult, so itâ€™s necessary to use some special method to process it. the clustering object this paper studies is website content, which is a kind of typical high dimensional clustering. From the concept of clustering and starting with the characteristics of high dimensional data, we centre on three feature attributesâ€™impact, which are the high dimensional space similarity measure, weight represent and "noise" reduction to research and improve.Because of the high-dimensional features for the data of this paper, the method suitable for the low dimensional similarity measurement often failure, so in this paper we use a kind of similarity measurement called the Nsim() suitable for text data with characteristics of high dimensional and sparse. Experiments prove this similarity measurement in the high dimension space, still can have good stability and distinguish ability. Features of the weight forms the space vector model has the decisive role, so it has very big effect on the clustering results. HTML file label information to web category plays the more important role than general characteristic attributes, thus, this article puts forward a suitable for web page of improvement of the TF-IDF weighting method.K-means method is a kind of typical rapid clustering algorithm based on the division, the choice of the k cluster center of traditional K-means method is random, this often makes the clustering results not stable and no guarantee of the clustering effect. This paper put forward the improvement of the centerâ€™s determination, computing center through the maximum and minimum rules in order to make K-means algorithm reduce the negative effect that website content "noise" attributes make on the clustering, in the clustering process, this paper blend in the space model correction method, through the use of a comprehensive feature attribute measure method, we judge the importance of The characteristic attributes in order to fix the feature space to achieve the purpose of "noise reduction". The experiment proved that the improved algorithm in dealing with website data, has obvious advantages in the stability of the algorithm compared with the traditional K-means and the clustering effect has improved.

Keywords/Search Tags:

high-dimensional sparse, similarity measurement, weighting, K-means, Featureselection, Maximum and minimum rules

PDF Full Text Request

Related items

1	Based On Structural Similarity And Sparse Repres-Entation Research On FR_IQA Algorithm
2	Research On RBF Neural Network Algorithm And Its Application In High Dimensional Data Preprocessing
3	Research On Locality Sensitive Hashing-Based Similarity Search
4	The Research On Dynamic And Abstract Clustering Method Of High Dimensional Sparse Data
5	The Limitations Of Collaborative Filtering Algorithm And Its Improvement
6	Research On Feature Selection Algorithm Of High-dimensional Data Based On Intelligent Optimization
7	Improvement Of K-Means Algorithm And Its Application In Weibo Topic Discovery
8	High-dimensional Data-oriented Clustering Algorithm Design And Tensor Low-rank Representation Research
9	Inter-industry Sparse Portfolio Research Based On Machine Learning In High-dimensional Framework
10	Research On One-class Classifier Based On Geometric Covering Model Of Target Class In High-dimensional Space