Research On Web Document Clustering Based On Sentential Maximum Frequent Word Sets

Posted on:2008-08-24

Degree:Master

Type:Thesis

Country:China

Candidate:L Yuan

Full Text:PDF

GTID:2178360272967745

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Web document clustering could help search engineering to find out the web pages with high quality. It is an important research direction in web mining area. The one of the keys of the Web document clustering technique is the choice of the characteristic items. The theme of a document is not related of all word in the document, and the key is to find out the most feature items that reflect the themes of document. The dimension of the frequent pattern obtained from existing mining algorithm is high and not reflect the expression of semantic information well. How to mine the ideal characteristic items is becoming a very important facet of clustering algorithm.To resolve the problem, referring the current work in data mining field, present a document database model. Each document is mapped to a database, each sentence is regarded as a transaction and each word is regarded as an item. Then mining the characteristic word set who the most reflect the document by association rules mining algorithm. Compared with traditional frequent characteristic items, the frequent word set based sentence include more local information.In accordance with the feature of Web document with great quantity, present a two cluster model with initial clustering and precise clustering. After the initial clustering, merging or separating the classes based on the distance between two classes and links intensity threshold in a class, then achieving document clustering. In this process, compute the contribution of each frequent word set to clustering by using variable precision rough set and compute weight of each frequent word set.Present expansion of cluster description based on tolerant rough set. In order to intensify the effect of clustering, need to describe every class after acquiring the result of cluster. Because there are some syntax phenomenon such as synonyms or simplified versions in language. In order to express each class, need to expand the words. Tolerant rough set model have big advantage in processing fuzzy and uncertain relations. In the field of information retrieval, especially inquiries term expansion, the relation between documents and process the relation between the features with the full application.

Keywords/Search Tags:

Web Document cluster, Rule Set, Association Rules, Maximum Frequent Words Set

PDF Full Text Request

Related items

1	Research On Maximum Frequent Itemsets Based On Improved FP-tree
2	Research On Medical Image Classification Based On Bag-of-Words Model And Association Rules
3	Research And Application Of An Improved Algorithm For Association Rules
4	The Association Rule Mining Algorithm Design And Implementation,
5	Based On The Maximum Frequent Set Data Mining Association Rules Algorithm
6	Fp-tree-based Association Rule Mining Algorithm Design And Implementation
7	Research On Algorithm Of Mining Association Rules Based On FP Tree
8	Research On Algorithm Of Mining Association Rules Based On Fp Tree
9	Studies On Algorithms Of Association Rule Mining In Data Mining
10	The Research And Implementation Of Quantitative Association Rules