Font Size: a A A

Research Of Web Text Clustering Technology And Clustering Result Visualization

Posted on:2009-08-18Degree:MasterType:Thesis
Country:ChinaCandidate:L H MaFull Text:PDF
GTID:2178360272463225Subject:Computer applications
Abstract/Summary:PDF Full Text Request
Nowadays, with the popularization of Internet and development of computer technology, Web has become a huge, dynamic, heterogeneous information resources library。With the rapid growth of Web data, people need be access to the knowledge fast and effectively from the Web. Web text clustering is a key task in Web Data Mining. Clustering analysis assists in reducing search space and decreasing information retrieving time. It is helpful for efficiently discovering documents likely similar to another one. It is also useful to improve the recall and precision of IR systems and personalize search engines effectively. Thereby, Web text clustering is a key task in Web mining. This thesis mainly studies Web text clustering and clustering result visulazation technology。In the first, the thesis introduces the concept of clustering, classification of Web data mining, text clustering method commonly used, and relevant data preprocessing, cluster assessment and clustering feature visualization techniques Etc.Secondly, the thesis roundly introduces the processing of Web text clustering and the key technologies involved, and then analysis of these technologies on the status and problems. Thirdly, on the basis of the analysis of the K - Means basic algorithm, introduces a new improved function for weighting terms in the Web text clustering. The weight adjustment Function uses HTML tags information and the location Semantic of Web text, at the same time increases information gain weight factor and improves the ability of word feature to distinguish between the categories. On the basis of concluding and the analysis of existing data visualization, the thesis improves the traditional parallel coordinates and improved 2D scatter plots visualization methods, to increase intuitionist, understandability. Implement a dynamic visual clustering method based on parallel coordinates.The last, based on the above analysis and research, the thesis designs and implements Parallel K-means clustering algorithm and Web text clustering system, using K-means and the whole chain based on hierarchical clustering to clustering Web text. On several text sets, the paper evaluates and demonstrates the improved parallel K-means algorithm, compared with traditional K-means algorithm, which has both classification accuracy and better understandability. And based on the problems in the experiments, concurrency K-means clustering algorithm has the same results with the serial algorithm, but the implementation of the efficiency is much improved.
Keywords/Search Tags:Web Text Clustering, K-means Algorithm, Weight Adjustment, Clustering Result Visualization
PDF Full Text Request
Related items