Font Size: a A A

Design And Implement Of Web Document Clustering System

Posted on:2007-09-25Degree:MasterType:Thesis
Country:ChinaCandidate:Y Q JiangFull Text:PDF
GTID:2178360182995832Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
We now have lived in an information society. Each kind of information inflates suddenly. The data mining and the knowledge discovered arises at the historic moment, and displays the formidable vitality, which can help people use the information effectively. Clustering analysis is an important part of the Data Mining research. Clustering is the process of grouping the physical or the abstract object set into classes or clusters.This paper first researched data express, characteristic extraction and weights calculation in web documents clustering, developed a software of download news from Internet, extraction news content, distill original words, weights calculation, clustering, clustering results visualization, this software adopts a multi-threading and XML technique.The clustering algorithms of K-means were introduced and analyzed. Then we improved the Euclidean Distance on web documents clustering. Compared to the traditional Euclidean Distance, the improved Euclidean Distance improve the speed and quality of documents clustering, near by the effect of Cosine Distance which used to apply in text clustering.A new intersection-based clustering combination algorithm was presented, which imitates the ways of voting. Assigns some different clustering results of same data set, this algorithm extracts the corresponding relations of each cluster in these different clustering results first, and then we calculate the intersection of corresponding clusters of these results, put the remaining disputable objects to vote, finally distribute the objects unaccepted after voting to the nearest center's cluster, or distribute the remaining disputable objects to the nearest center's cluster without voting.Several visualization methods of displaying clustering results were realized, it includes the chart of random points, order points, electron cloud, bars and pie. These ways have different advantages and disadvantages, so as to cooperate with others. The object place in the order points chart is fixed, could show the information of each object in the chart. It is suit for displaying the dynamic clustering process, and wide applications in this paper.Finally, we tested with many web documents, validated the improved Euclidean Distance and intersection-based clustering combination algorithm.
Keywords/Search Tags:Data Mining, Clustering Analysis, Document Mining, Preprocessing, Clustering Combination, Visualization, Euclidean Distance
PDF Full Text Request
Related items