Font Size: a A A

Partition-based Clustering Research And Its Application In Web Mining

Posted on:2008-05-21Degree:MasterType:Thesis
Country:ChinaCandidate:L YanFull Text:PDF
GTID:2178360242467084Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of network and overspreading of information, the users on Internet find that it is difficult to acquire useful information quickly and efficiently in such an ocean of information. It is known to all that documents are the main form of information on the Internet, and a tool which can be used to catch knowledge quickly and efficiently from Web documents is required to meet human's needs. Then the text clustering technology emerges as the times require. Text clustering divides large numbers of information into several clusters, and it is widely applied in information retrieval, e-mail filtering, web pages classification and many other fields. Therefore, text clustering has become one of the most important research fields of data mining through out the whole world. In this paper, two problems are studied in text clustering. One is calculating term weight and reducing dimension in the preprocessing. The other one is selecting initial points in partition-based K-Means.Firstly, how to calculate term weight is studied. The structure of Web pages is important to clustering. Secondly, characteristic vectors of documents, which are always high-dimensional and sparse, are not suitable for clustering. Experiments are carried out to learn feature space contraction. It is found that correctness of clustering results rises as the number of features increases, but descends a little with the continuous increase of features. Thirdly, selecting the initial points for K-Means is analyzed. In order to improve the traditional method which chooses start points randomly, max-min distance method is applied on the sample sets to choose better cluster centers at the beginning. And then a text clustering method based on max-min distance is proposed.Finally, the text clustering method based on max-min distance is applied to cluster penal web pages. Through the comparison of this method and K-Means, it is shown in the results that the algorithm in this paper has improved the accuracy and stability of the clustering.
Keywords/Search Tags:Text Clustering, K-Means, Cluster Centers
PDF Full Text Request
Related items