Font Size: a A A

Research On Text Clustering Methods And Their Applications

Posted on:2009-09-23Degree:MasterType:Thesis
Country:ChinaCandidate:B Y LiFull Text:PDF
GTID:2178360272989860Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of Internet, data on the web is increasing explosively. The most of this data is textual. Because of its non-numerical feature and semantic complexity, Text Mining becomes a hot and difficult spot of Data Mining. Due to restrictions on training samples, Text Categorization can not work well in some text related applications, such as Spam Detection. Text Clustering, for which can classify objects automatically without training samples, becomes a new important solution of these applications.Text Clustering Method mainly includes: Text Representation Model and Text Clustering Algorithms. So far, most Text Representation Models are based on Terms, which means the data generated by these models are high dimensional and sparse. In high dimensional space, clusters only exist in some subspaces, and have different subspaces. Affected by "Dimension Curse", Traditional Clustering Algorithms can't compute these high dimensional model data directly until reducing the dimensions.Based on Vector Space Model (VSM), we studies on Text Clustering Algorithms as a start point, then try to form a related Text Clustering Method based on these algorithms. By studying the process of Traditional Text Clustering Method, we analyzed the requirement of Traditional Text Clustering Method about Clustering Algorithms, and proposed a robust Clustering Algorithm to form a related Text Clustering Method. By analyzing the weakness of Traditional Text Clustering Method in Dimension Reduction, we proposed a novel Subspace Clustering Algorithm. Based on it, a related Text Subspace Clustering Method is generated. The main works of this paper is follow:1. Studying CURE and associating the idea of shrinking data points with grid density to get a fine-grained measurement of local density, and then proposed a grid-based clustering algorithm using data points shrinking. Experimental results have shown its effectiveness;2. Based on the grid-based algorithm, a related Text Clustering Method is generated to apply on Spam Detection and Chinese Text Clustering;3. Aimed to the instability of clustering results and the dependence on initialization, a novel Soft Subspace Clustering Algorithm for text documents and a related initial algorithm are proposed;4. Based on the above, a related Text Subspace Clustering Method is generated to apply on Spam Detection and Chinese Text Clustering, which is more effective at dimension reducing than Traditional Text Clustering Method in experiments.
Keywords/Search Tags:Data Mining, Text Mining, Text Clustering, Spam Email
PDF Full Text Request
Related items