Font Size: a A A

Study On Similarity-based Text Clustering Algorithm And Its Application

Posted on:2011-01-26Degree:MasterType:Thesis
Country:ChinaCandidate:S Q MaFull Text:PDF
GTID:2178360302993977Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Text Clustering is an important branch of Text Mining, which has get more depth research because of its unique knowledge discovery functions. Today, there are lots of efficient text clustering algorithms which have been widely used in the automatic document finishing, the organization of search results and digital library services. However, with expansion of document sets, traditional text clustering algorithm encountered a number of insurmountable difficulties. For instance, algorithm ignores the semantic correlation between words, the instability of result. These papers mainly for the above problems do some research on text clustering.In the first place, this paper discusses some knowledge of text mining, and analyzes the necessity of text clustering and the research actuality of text clustering at home and abroad. Then the traditional text clustering algorithms are introduced, and which are compared and analyzed. It puts more emphases on the deep study of document representation and DBSCAN algorithm and makes the improvement towards related algorithms, meanwhile designs a text clustering system based on the previous theories. The works in this paper is as follows:(1) Introduced to the traditional text clustering algorithms, and they were compared and analyzed from the scalability, multi-dimensional, dealing with high dimensional data and so on.(2) In order to represent documents, this paper presents the Chinese text clustering algorithm using semantic list. First of all, the algorithm use of semantic similarity to compute text similarity, access to text semantic relevance between texts, and then make use of synonym or near-synonym of the semantic list to reduce redundancy of the words that reduced dimension of texts. Finally, used partitioning clustering algorithm. Experiments showed that CTCAUSL algorithm improve the accuracy of clustering results.(3) A text density clustering algorithm with the optimized threshold values is proposed to solve the problem of reduced clustering performance of the DBSCAN algorithm because of global threshold values. The proposed algorithm sorts objects with k-neighbor distance, and discerns arrays with different densities by quantile, and finds the corresponding optimization, then carries out clustering of objects using density clustering algorithm based on optimized threshold values. The advanced clustering algorithm has overcome the problem of reduced clustering performance caused by threshold values selection, and has improved clustering accuracy and efficiency. The paper stores clusters with tree structure, and has made clusters more legible. The experimental results show the effectiveness of the algorithm.(4) On the basis of studying theory, the algorithms presented in this article are used in the text collection, and Design of a text clustering system, which provide pretreatment module, semantic list module, text clustering module and result evaluation module. From the analysis of the main functions of each module of system architecture and its application, it shows that the system has good extensibility and flexibility.
Keywords/Search Tags:text mining, text clustering, text representation, semantic list, similarity calculation, cluster representation, DBSCAN algorithm, TDCAOTV algorithm, quantile
PDF Full Text Request
Related items