Font Size: a A A

Study On Several Issues Of Text Clustering

Posted on:2008-11-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:M T GaoFull Text:PDF
GTID:1118360245990942Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
Facing the massive volume and high dimensional text data, how to build effec-tive and scalable algorithm for text clustering is one of research directions of data mining. Aiming at above issues, some basic problems of text clustering have been studied substantially as follows.A new pursuit projection based text clustering algorithm is proposed. It looks for the optimal projection direction by using genetic algorithm, projects text feature vector in high dimensional into a low dimensional space. The structure features of the texts can be shown intuitionisticly and the results of text clustering can be visu-alized.Aim at the problems of high dimensional and predetermined cluster number, several LSA, CI, RP, NMF based RPCL text clustering algorithms are also proposed, which reduce dimension with LSA etc. and cluster texts with RPCL. It can not only reduce dimension effectively, but also overcome the problem of partitoning cluster in advance.Based on Vector Space Model, a new double-word relation based text feature selection model is proposed in this dissertation. This model adds double-word rela-tion information of texts to Vector Space Model so that it contains more abundant and more exact text feature information. Combining with Latent Semantic Analysis, it not only reduces dimension effectively, but also cuts down some noises and stands out the semantic feature in the text. So, it can improve the quality of text mining greatly.Based on Document Index Graph feature expression model, a new text similar-ity calculating method is proposed, in which text similarity can be adjusted to get better distinguishability by using a proper transformation function and to be in favor of text clustering analysis and classification.Suffix Tree Clustering is used in Chinese text clustering, in which text is re-garded as a set of phrases and the similarity of texts is denoted by suffix tree. This can solve the problems of multi thematic text clustering, overcome the problem of predefined cluster number, and realize soft text clustering.
Keywords/Search Tags:Text Mining, Text Clustering, Feature Denotation, Feature Dimension Reduction, Competitive Learning
PDF Full Text Request
Related items