Web Document Clustering Based On Knowledge Granularity

Posted on:2006-04-07

Degree:Master

Type:Thesis

Country:China

Candidate:F L Huang

Full Text:PDF

GTID:2168360155471497

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With rapidly advanced, Internet (or WWW) has enormously changed people's life mode, nowadays'WWW has become a main information channel that makes for better communication and information acquisition. There is a multitude of valuable knowledge characterized as latency in the large, distributed data repository on the Internet. All users, individuals or enterprises, must confront the challenging issue: how to efficiently and effectively acquire potentially useful knowledge from Internet. Web Data Mining derived from the Data Mining has been a hot and important topic in Data Analysis that has attracted great many experts and researchers. In last ten years, Web Data Mining has been widely studied and achieved a great progress. Many Web mining technologies have advanced to a mature stage and have been successfully applied to real world applications. For example, search engines make good the information acquiring from Internet; e-business provides a novel business mode for benefiting enterprises. Unlike traditional data, Web data is characterized as complicated structure, various forms and rich contents, and users'requirements can be diverse. This leads to Web data mining much more challenging. The most common yet important data is of the form: Web pages represented by markup language. Existing Web data mining can be roughly classified into three categories: content mining, usage mining and structure mining. Dominating technologies used in Web data mining are association analysis, time sequence analysis, and clustering analysis. Web data clustering is a key task in Web Data Mining. Clustering analysis assists in reducing search space and decreasing information retrieving time. It is helpful for efficiently discovering documents likely similar to another one. It is also useful to improve the recall and precision of IR systems and personalize search engines effectively. Thereby, Web clustering is a key task in Web mining. In this thesis, based on deeper understanding of the existing data mining and Web document clustering methods, we first analyze the traditional text representation models, text clustering algorithms, and their limitations. And then we adopt knowledge granularity to build Web documents clustering theory and algorithm. The main contributions are as follows. (1) Traditional Web clustering algorithms are based on two-level knowledge granularity, i.e. document and term. It can lead to that clustering results are "false relevant". This thesis proposes a new method for Web document representation that is of many-level knowledge granularity, referred to a concrete model: "Document-Paragraph-Term"(abbreviated as "D-P-T") model. (2) As well known, traditional VSM similarity measures can result in lots of "zero similarity". To solve this problem, we use tolerance rough set theory to design an Extended VSM (abbreviated as EVSM) similarity. (3) We use K-Means as our clustering algorithm. However, although K-Means clustering is good for dealing with huge documents, it is outlier-sensitive(this can generate a poor output when clustering non-spherical data). For this, we innovate upon the K-means clustering, named as NK-means algorithm. (4) Finally, we develop a platform for Web data analysis, named WebAnalyser. The core part is a web-clustering algorithm WCBGK. It has been experimentally evaluated, and demonstrates that our approach WCBKG, compared with traditional Web document clustering algorithms, has both higher classification accuracy and better understandability.

Keywords/Search Tags:

Data Mining, Web Mining, Web Document Clustering, Rough Set, Knowledge Granularity

PDF Full Text Request

Related items

1	The Research Of Clustering Based On Rough Set Theory
2	The Research Of Granularity Computing Based On Rough Set In Data Mining
3	Research On Method Of Data Mining Based On Granular Computing Model
4	Studies On Granularity Data Mining And Its Application In Process Industry
5	Design And Implement Of Web Document Clustering System
6	Study On Methods Of Data Mining And Text Mining Based On Rough Set
7	Research Of Clustering Analysis And Its Application In Document Mining
8	Research And Application Of Data Mining Based On Rough Set In Clustering Discrimination
9	Research On Dynamic Data Mining Methods And Techniques Based On Rough Set Theory
10	An Algorithm Research On Data Mining Based On Rough Set