Semantic Correlation Based Text Clustering Approaches

Posted on:2007-09-25

Degree:Master

Type:Thesis

Country:China

Candidate:S X Song

Full Text:PDF

GTID:2178360212985406

Subject:Software engineering

Abstract/Summary:

Text clustering techniques categorize large scale text information into groups with high inner similarity, which makes it easier and faster to view and find relevant information. Different from structured data mining, the data sets of text clustering are text objects with semi-structure or non-structure and have sparse data spaces. According to this special characteristic, we try to find the semantic correlation of text objects in different stages of the clustering process and use this correlation to improve the text clustering results.In the preprocess of text objects, current existing document representation systems are typically the bag-of-words model, where single word and word stem are used as features for representing document content. In order to strengthen the discriminative feature of text objects, we propose a novel model of document representation, Concept CHain Model(CCHM). During the clustering process, the hierarchical clustering algorithm processes in different levels of concept chains.In the definition of text objects similarity, current existing methods of text clustering use symmetry similarity or dissimilarity to measure the correlation of documents. We propose a novel approach, which uses asymmetric similarity for text clustering. According to the sparseness of asymmetric similarity matrix, we carry on the clustering analysis by strong components of sparse matrix. Our approach, TCUAP algorithm, provides a conceptual structure after the hierarchical clustering.In the process of clustering, we construct a semantic correlation network by analyzing the distribution asymmetric similarity. We conjecture the power law feature of the connection distribution, which means hub points may exist in the semantic correlation network. The SCN agglomerative hierarchical clustering approach classifies these hub points first. Both objects similarity and neighbors similarity are considered in the definition of hub points proximity. Finally, we assign the rest text objects to their nearest hub points. Furthermore, by using the asymmetric correlation of neighbors,we improve the ROCK algorithm. The IROCK (Improved ROCK) algorithm performs clustering analysis based on the overlap information of asymmetric proximities between text objects. We carry on the clustering process in an agglomerative hierarchical way.

Keywords/Search Tags:

Text Clustering, Data Mining, Unsupervised Learning, Asymmetric Similarity, Semantic Correlation

Related items

1	Research On Semantic Similarity Of Text Based On Unsupervised Contrastive Learning
2	Study On Similarity-based Text Clustering Algorithm And Its Application
3	Research On Text Clustering Based On Semantic Similarity
4	The Research And Application Of Unsupervised And Supervised Short Text Similarity Measure
5	Research And Implementation Of The Text Cluster Based On Text Similarity Caculation
6	Research On Thesis Text Clustering Based On Semantic Similarity
7	Research Of Feature Vector Value Weighted Based On Semantic Analysis In Chinese Text Clustering
8	The Study Of Measures And Applications Of Short Text Semantic Similarity
9	Text Classification Method Based On Unsupervised Clustering And Naive Bayesian Classifier
10	Search Of Group Intelligent Text Clustering Methods Based On Semantic Similarity