Font Size: a A A

Semantic Correlation Based Text Clustering Approaches

Posted on:2007-09-25Degree:MasterType:Thesis
Country:ChinaCandidate:S X SongFull Text:PDF
GTID:2178360212985406Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Text clustering techniques categorize large scale text information into groups with high inner similarity, which makes it easier and faster to view and find relevant information. Different from structured data mining, the data sets of text clustering are text objects with semi-structure or non-structure and have sparse data spaces. According to this special characteristic, we try to find the semantic correlation of text objects in different stages of the clustering process and use this correlation to improve the text clustering results.In the preprocess of text objects, current existing document representation systems are typically the bag-of-words model, where single word and word stem are used as features for representing document content. In order to strengthen the discriminative feature of text objects, we propose a novel model of document representation, Concept CHain Model(CCHM). During the clustering process, the hierarchical clustering algorithm processes in different levels of concept chains.In the definition of text objects similarity, current existing methods of text clustering use symmetry similarity or dissimilarity to measure the correlation of documents. We propose a novel approach, which uses asymmetric similarity for text clustering. According to the sparseness of asymmetric similarity matrix, we carry on the clustering analysis by strong components of sparse matrix. Our approach, TCUAP algorithm, provides a conceptual structure after the hierarchical clustering.In the process of clustering, we construct a semantic correlation network by analyzing the distribution asymmetric similarity. We conjecture the power law feature of the connection distribution, which means hub points may exist in the semantic correlation network. The SCN agglomerative hierarchical clustering approach classifies these hub points first. Both objects similarity and neighbors similarity are considered in the definition of hub points proximity. Finally, we assign the rest text objects to their nearest hub points. Furthermore, by using the asymmetric correlation of neighbors,we improve the ROCK algorithm. The IROCK (Improved ROCK) algorithm performs clustering analysis based on the overlap information of asymmetric proximities between text objects. We carry on the clustering process in an agglomerative hierarchical way.
Keywords/Search Tags:Text Clustering, Data Mining, Unsupervised Learning, Asymmetric Similarity, Semantic Correlation
PDF Full Text Request
Related items