Font Size: a A A

Text Clustering Method Based On Frequent Itemsets

Posted on:2010-10-24Degree:MasterType:Thesis
Country:ChinaCandidate:J XiaoFull Text:PDF
GTID:2208360278970221Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Text is an important information carrier, the number of which expands with the development of Internet. As an unsupervised machine learning method, text clustering method is an important method for organizing text message, summary and navigation, and is focused by a growing number of researchers. Text clustering is playing an important role in many text mining and information retrieval systems.This paper focuses on how to improve Chinese text clustering steps so as to gain a good clustering result. Steps related to text clustering mainly include texts pre-processing, choosing features, text representation and clustering, which play a vital role in clustering quality. Traditional clustering algorithms are VSM-based. VSM is a model based on keywords, which ignores the potential semantic relations between words. Additionally, its inherent problem of "high-dimensional curse" has become the bottleneck to enhance algorithm's performance. These problems are very disruptive to the efficiency of text clustering algorithms. This paper introduces HowNet as ontology of clustering algorithms. By mapping keywords of texts to corresponding concepts in HowNet, algorithms can be carried out on set of concepts. Then, semantic-missing of VSM can be compensated. To improve the performance of algorithm, we introduce the concepts of frequent item-sets and non-overlapping and adopt a new partitioning rule to realize the clustering of original texts. Based on these ideas, a clustering algorithm base on frequent item-sets named CFI is proposed.In the final section of the paper, several experiments are designed to analyze the feasibility of CFI. Experimental results show that through integration of HowNet and the idea of frequent item-sets, the proposed algorithm effectively reduces the dimension of characteristics of texts, improves the accuracy of the cluster and reaches better quality compared with the traditional peer frequent item-sets based methods.
Keywords/Search Tags:Text clustering, Concept mapping, Frequent Item-set
PDF Full Text Request
Related items