Research On The Key Techniques Of Chinese Text Clustering

Posted on:2016-05-16

Degree:Master

Type:Thesis

Country:China

Candidate:M L Shao

Full Text:PDF

GTID:2308330464970013

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet technology, the textual information is increasing explosively. How to get the valuable information that is hidden in the massive textual information has become a major research subject. Text clustering technology, as a major method of text information mining, has received much attention from scholars at home and abroad.In this thesis, it is firstly reviewed the research status on the key techniques of text clustering at home and abroad. Then it is introduced the key techniques used in text clustering analysis, including text preprocessing, text feature extraction, text modeling, text similarity computing and clustering algorithm, etc. Among them, text similarity computing acts as the most crucial factor in text retrieval and the efficiency of clustering algorithm have a direct influence on the final clustering effect. Consequently, this thesis lays its focus on the two key techniques of text similarity computing and fuzzy clustering algorithm.Through the study of related theories about Latent Dirichlet Allocation (LDA) topic model and word co-occurrence, it is proposed a text similarity computing algorithm based on LDA topic model and word co-occurrence analysis that introduces the word co-occurrence analysis based text semantic similarity measure method of topic feature words into the LDA topic model. Experimental results show that this similarity computing method improves text clustering precision ratio, recall ratio, etc.Classical Lumer-Faieta (LF) algorithm lacks rigorous mathematical basis and it randomly sets the probability of ant lifting and putting of target data based on prior knowledge. For addressing these defects, this thesis proposes a fuzzy clustering algorithm (named as GAFCM) integrating with granular computing, ant colony algorithm and fuzzy theory. The idea of fuzzy granular computing is introduced to the LF algorithm and the ant picking up or putting down of the target is decided by similarity membership function. For overcoming the deficiency that the Fuzzy C-Means (FCM) algorithm is easily influenced by the initial cluster center and sensitively to outliers, it is used the improved ant colony algorithm to complete the initial clustering of text data, and then take the clustering center as the initial center of FCM algorithm to do clustering. This algorithm overcomes deficiency that the clustering results of FCM algorithm is sensitive to outlier and easily influenced by the initial clustering center. The simulative experiments show that this algorithm has better comprehensive performance and better clustering effect.

Keywords/Search Tags:

fuzzy clustering, Latent Dirichlet Allocation(LDA), word co-occurrence, similarity computing, granular computing, ant colony algorithm, fuzzy C-means (FCM) algorithm

PDF Full Text Request

Related items

1	The Research And Application Of Parallel Latent Dirichlet Allocation And Clustering Algorithm
2	Study On The Clustering Ensemble Algorithm Based On Granular Computing
3	The Application Of Granular Computing In Clustering Analysis
4	The Application And Research Of Granular Computing In Hierarchical Fuzzy Control Based On Fuzzy Set
5	Fuzzy C-means And K-means Clustering Algorithm And Its Parallel
6	Research On Clustering Algorithm Based On Particle Computing
7	Research On Fuzzy Control Algorithm Based On Granular Function
8	Scene Classification Algorithm Based On Markov Random Field And Fuzzy Set Theory
9	Research On Text Classification Method Based On FCM Clustering
10	The Research On Fuzzy C-means Documents Clustering Based On Ant Colony Optimization