Font Size: a A A

Study And Implementation Of Frequent Closed Word Sequence Set Based Hierarchical Clustering Algorithm

Posted on:2011-03-08Degree:MasterType:Thesis
Country:ChinaCandidate:H C GengFull Text:PDF
GTID:2248330395458007Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of World Wide Web, computers all over the world are connected together to form a large database. Owning a computer which is a node of the network is equal to owning all the information resource of the World Wide Web. Lacking of information resources is not a problem that troubled us very much. However, here comes another annoying problem that it is difficult to find exactly what we need in such a big database. Data mining technology arise at the historic moment. Text format is the most commonly used data format, so text mining is becoming more and more popular in data mining domain. Text clustering is such a technology that can assemble texts with similar topic and detach texts with different topic.Based on the above background, we analyse the drawbacks of the existing text clustering algorithms and go into two aspects of text clustering:text pretreatment process and text clustering algorithm. The major work and contribution of this paper are as follows.We introduce the processes that are related to text clustering systematically and propose to take Frequent Closed Word Sequence as features of text vector space model. The purpose is to reduce the dimensionality of the text, increase the granularity of the feastures and consider the effect of word ordering and word consecutiveness on express the topic of the text. Then we design a pattern-based method to find all the FCWS in a text. Finally we use Frequent Closed Word Sequence Set that extract from the database to form an algorithm named Frequent Closed Word Sequence Set Based Hierarchical Clustering Algorithm. We use Frequent Closed Word Sequence Set as similarity measurement futher reduced the dimensionality of the text and also take the number of clusters as an optional input parameter. FCWSS-Based AHC can not only produce high clustering accuracy but also generate cluster labels for every cluster, so the clusters are more intelligibility. This algorithm can be used to build user interest model in Recommendation System.
Keywords/Search Tags:frequent closed word sequence, cluster lable, hierarchical clustering
PDF Full Text Request
Related items