Font Size: a A A

The Analysis Of K-means Cluster Algorithm For Website Content

Posted on:2012-09-15Degree:MasterType:Thesis
Country:ChinaCandidate:Y J ShiFull Text:PDF
GTID:2268330425997283Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In recent years, with the rapid development of Internet and life informatization level enhances increasingly, text data resources grow with surprising speed,which make it difficult to obtain the objective information, and the utilization rate of the information down. High-dimensional data becomes the mainstream day by day. So in the practical application of clustering, the research on high-dimensional data clustering method plays the more and more important role. But some unique characteristics of the high dimensional data itself, make high dimensional data mining is very difficult, so it’s necessary to use some special method to process it. the clustering object this paper studies is website content, which is a kind of typical high dimensional clustering. From the concept of clustering and starting with the characteristics of high dimensional data, we centre on three feature attributes’impact, which are the high dimensional space similarity measure, weight represent and "noise" reduction to research and improve.Because of the high-dimensional features for the data of this paper, the method suitable for the low dimensional similarity measurement often failure, so in this paper we use a kind of similarity measurement called the Nsim() suitable for text data with characteristics of high dimensional and sparse. Experiments prove this similarity measurement in the high dimension space, still can have good stability and distinguish ability. Features of the weight forms the space vector model has the decisive role, so it has very big effect on the clustering results. HTML file label information to web category plays the more important role than general characteristic attributes, thus, this article puts forward a suitable for web page of improvement of the TF-IDF weighting method.K-means method is a kind of typical rapid clustering algorithm based on the division, the choice of the k cluster center of traditional K-means method is random, this often makes the clustering results not stable and no guarantee of the clustering effect. This paper put forward the improvement of the center’s determination, computing center through the maximum and minimum rules in order to make K-means algorithm reduce the negative effect that website content "noise" attributes make on the clustering, in the clustering process, this paper blend in the space model correction method, through the use of a comprehensive feature attribute measure method, we judge the importance of The characteristic attributes in order to fix the feature space to achieve the purpose of "noise reduction". The experiment proved that the improved algorithm in dealing with website data, has obvious advantages in the stability of the algorithm compared with the traditional K-means and the clustering effect has improved.
Keywords/Search Tags:high-dimensional sparse, similarity measurement, weighting, K-means, Featureselection, Maximum and minimum rules
PDF Full Text Request
Related items