Font Size: a A A

Research On Web Document Clustering Based On Sentential Maximum Frequent Word Sets

Posted on:2008-08-24Degree:MasterType:Thesis
Country:ChinaCandidate:L YuanFull Text:PDF
GTID:2178360272967745Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Web document clustering could help search engineering to find out the web pages with high quality. It is an important research direction in web mining area. The one of the keys of the Web document clustering technique is the choice of the characteristic items. The theme of a document is not related of all word in the document, and the key is to find out the most feature items that reflect the themes of document. The dimension of the frequent pattern obtained from existing mining algorithm is high and not reflect the expression of semantic information well. How to mine the ideal characteristic items is becoming a very important facet of clustering algorithm.To resolve the problem, referring the current work in data mining field, present a document database model. Each document is mapped to a database, each sentence is regarded as a transaction and each word is regarded as an item. Then mining the characteristic word set who the most reflect the document by association rules mining algorithm. Compared with traditional frequent characteristic items, the frequent word set based sentence include more local information.In accordance with the feature of Web document with great quantity, present a two cluster model with initial clustering and precise clustering. After the initial clustering, merging or separating the classes based on the distance between two classes and links intensity threshold in a class, then achieving document clustering. In this process, compute the contribution of each frequent word set to clustering by using variable precision rough set and compute weight of each frequent word set.Present expansion of cluster description based on tolerant rough set. In order to intensify the effect of clustering, need to describe every class after acquiring the result of cluster. Because there are some syntax phenomenon such as synonyms or simplified versions in language. In order to express each class, need to expand the words. Tolerant rough set model have big advantage in processing fuzzy and uncertain relations. In the field of information retrieval, especially inquiries term expansion, the relation between documents and process the relation between the features with the full application.
Keywords/Search Tags:Web Document cluster, Rule Set, Association Rules, Maximum Frequent Words Set
PDF Full Text Request
Related items