Study On Similarity-based Text Clustering Algorithm And Its Application

Posted on:2011-01-26

Degree:Master

Type:Thesis

Country:China

Candidate:S Q Ma

Full Text:PDF

GTID:2178360302993977

Subject:Computer application technology

Abstract/Summary:

Text Clustering is an important branch of Text Mining, which has get more depth research because of its unique knowledge discovery functions. Today, there are lots of efficient text clustering algorithms which have been widely used in the automatic document finishing, the organization of search results and digital library services. However, with expansion of document sets, traditional text clustering algorithm encountered a number of insurmountable difficulties. For instance, algorithm ignores the semantic correlation between words, the instability of result. These papers mainly for the above problems do some research on text clustering.In the first place, this paper discusses some knowledge of text mining, and analyzes the necessity of text clustering and the research actuality of text clustering at home and abroad. Then the traditional text clustering algorithms are introduced, and which are compared and analyzed. It puts more emphases on the deep study of document representation and DBSCAN algorithm and makes the improvement towards related algorithms, meanwhile designs a text clustering system based on the previous theories. The works in this paper is as follows:(1) Introduced to the traditional text clustering algorithms, and they were compared and analyzed from the scalability, multi-dimensional, dealing with high dimensional data and so on.(2) In order to represent documents, this paper presents the Chinese text clustering algorithm using semantic list. First of all, the algorithm use of semantic similarity to compute text similarity, access to text semantic relevance between texts, and then make use of synonym or near-synonym of the semantic list to reduce redundancy of the words that reduced dimension of texts. Finally, used partitioning clustering algorithm. Experiments showed that CTCAUSL algorithm improve the accuracy of clustering results.(3) A text density clustering algorithm with the optimized threshold values is proposed to solve the problem of reduced clustering performance of the DBSCAN algorithm because of global threshold values. The proposed algorithm sorts objects with k-neighbor distance, and discerns arrays with different densities by quantile, and finds the corresponding optimization, then carries out clustering of objects using density clustering algorithm based on optimized threshold values. The advanced clustering algorithm has overcome the problem of reduced clustering performance caused by threshold values selection, and has improved clustering accuracy and efficiency. The paper stores clusters with tree structure, and has made clusters more legible. The experimental results show the effectiveness of the algorithm.(4) On the basis of studying theory, the algorithms presented in this article are used in the text collection, and Design of a text clustering system, which provide pretreatment module, semantic list module, text clustering module and result evaluation module. From the analysis of the main functions of each module of system architecture and its application, it shows that the system has good extensibility and flexibility.

Keywords/Search Tags:

text mining, text clustering, text representation, semantic list, similarity calculation, cluster representation, DBSCAN algorithm, TDCAOTV algorithm, quantile

Related items

1	Research On Text Clustering Based On Semantic Similarity
2	Research On Key Techniques Of Short-text Representation And Classification Based On Hybrid Semantic
3	Text Clustering Based On Center Point Selection And Deep Representation Learning
4	Research On Text Clustering Algorithm Based On DBSCAN
5	Research And Application Of Text Similarity Calculation Method Based On Structured Representation Learning
6	Research And Implementation Of The Text Cluster Based On Text Similarity Caculation
7	Research On Text Representation And Feature Extraction Methods Based On Conditional Co-occurrence Degree
8	The Research On Local Smooth Preserving Of Manifold Regularization Auto Encoder For Text Representation
9	The Research On Chinese Sentential Semantic Model Parsing And Text Representation
10	A Research On Text Analysis And Representation Based On Semantic Infomation