Font Size: a A A

Web Document Clustering Based On Knowledge Granularity

Posted on:2006-04-07Degree:MasterType:Thesis
Country:ChinaCandidate:F L HuangFull Text:PDF
GTID:2168360155471497Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With rapidly advanced, Internet (or WWW) has enormously changed people's life mode, nowadays'WWW has become a main information channel that makes for better communication and information acquisition. There is a multitude of valuable knowledge characterized as latency in the large, distributed data repository on the Internet. All users, individuals or enterprises, must confront the challenging issue: how to efficiently and effectively acquire potentially useful knowledge from Internet. Web Data Mining derived from the Data Mining has been a hot and important topic in Data Analysis that has attracted great many experts and researchers. In last ten years, Web Data Mining has been widely studied and achieved a great progress. Many Web mining technologies have advanced to a mature stage and have been successfully applied to real world applications. For example, search engines make good the information acquiring from Internet; e-business provides a novel business mode for benefiting enterprises. Unlike traditional data, Web data is characterized as complicated structure, various forms and rich contents, and users'requirements can be diverse. This leads to Web data mining much more challenging. The most common yet important data is of the form: Web pages represented by markup language. Existing Web data mining can be roughly classified into three categories: content mining, usage mining and structure mining. Dominating technologies used in Web data mining are association analysis, time sequence analysis, and clustering analysis. Web data clustering is a key task in Web Data Mining. Clustering analysis assists in reducing search space and decreasing information retrieving time. It is helpful for efficiently discovering documents likely similar to another one. It is also useful to improve the recall and precision of IR systems and personalize search engines effectively. Thereby, Web clustering is a key task in Web mining. In this thesis, based on deeper understanding of the existing data mining and Web document clustering methods, we first analyze the traditional text representation models, text clustering algorithms, and their limitations. And then we adopt knowledge granularity to build Web documents clustering theory and algorithm. The main contributions are as follows. (1) Traditional Web clustering algorithms are based on two-level knowledge granularity, i.e. document and term. It can lead to that clustering results are "false relevant". This thesis proposes a new method for Web document representation that is of many-level knowledge granularity, referred to a concrete model: "Document-Paragraph-Term"(abbreviated as "D-P-T") model. (2) As well known, traditional VSM similarity measures can result in lots of "zero similarity". To solve this problem, we use tolerance rough set theory to design an Extended VSM (abbreviated as EVSM) similarity. (3) We use K-Means as our clustering algorithm. However, although K-Means clustering is good for dealing with huge documents, it is outlier-sensitive(this can generate a poor output when clustering non-spherical data). For this, we innovate upon the K-means clustering, named as NK-means algorithm. (4) Finally, we develop a platform for Web data analysis, named WebAnalyser. The core part is a web-clustering algorithm WCBGK. It has been experimentally evaluated, and demonstrates that our approach WCBKG, compared with traditional Web document clustering algorithms, has both higher classification accuracy and better understandability.
Keywords/Search Tags:Data Mining, Web Mining, Web Document Clustering, Rough Set, Knowledge Granularity
PDF Full Text Request
Related items