With the rapid development of the Internet today, people are in an "informationexplosion" era. Currently there are vast amounts of semi-structured or unstructuredinformation, how fast and efficient mining of useful information for people, is aproblem which lots of scholars are working on it. Text document clustering is a methodof automatic classification, which does not require training. Currently most clusteringalgorithms do not have a high speed and accuracy.Firstly, for the above problem, we propose a graph-based structure of the textrepresentation model-WSCG (Weighted Subject Conceptual Graph), which divides thedocument concepts into centroid concepts and peripheral concepts bases on theirsemantic relations to the subject, and the semantic similarity between two documents iscalculated by centroid concepts and peripheral concepts respectively. Secondly, basedon the existing study of the clustering algorithm, to make the relation calculationbetween two documents more accurate during the clustering process, we design a textclustering algorithm based on WCSG. Finally, based on the study, a text clusteringsystem–SemCluster, is implemented in C++.Experiments show that the representation based WCSG text in the document textsimilarity calculations and clustering has higher accuracy than existing methods, whilethe text clustering system has been tested, proved the system met the designrequirements. |