Font Size: a A A

Research On The Key Techniques Of Chinese Text Clustering

Posted on:2016-05-16Degree:MasterType:Thesis
Country:ChinaCandidate:M L ShaoFull Text:PDF
GTID:2308330464970013Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology, the textual information is increasing explosively. How to get the valuable information that is hidden in the massive textual information has become a major research subject. Text clustering technology, as a major method of text information mining, has received much attention from scholars at home and abroad.In this thesis, it is firstly reviewed the research status on the key techniques of text clustering at home and abroad. Then it is introduced the key techniques used in text clustering analysis, including text preprocessing, text feature extraction, text modeling, text similarity computing and clustering algorithm, etc. Among them, text similarity computing acts as the most crucial factor in text retrieval and the efficiency of clustering algorithm have a direct influence on the final clustering effect. Consequently, this thesis lays its focus on the two key techniques of text similarity computing and fuzzy clustering algorithm.Through the study of related theories about Latent Dirichlet Allocation (LDA) topic model and word co-occurrence, it is proposed a text similarity computing algorithm based on LDA topic model and word co-occurrence analysis that introduces the word co-occurrence analysis based text semantic similarity measure method of topic feature words into the LDA topic model. Experimental results show that this similarity computing method improves text clustering precision ratio, recall ratio, etc.Classical Lumer-Faieta (LF) algorithm lacks rigorous mathematical basis and it randomly sets the probability of ant lifting and putting of target data based on prior knowledge. For addressing these defects, this thesis proposes a fuzzy clustering algorithm (named as GAFCM) integrating with granular computing, ant colony algorithm and fuzzy theory. The idea of fuzzy granular computing is introduced to the LF algorithm and the ant picking up or putting down of the target is decided by similarity membership function. For overcoming the deficiency that the Fuzzy C-Means (FCM) algorithm is easily influenced by the initial cluster center and sensitively to outliers, it is used the improved ant colony algorithm to complete the initial clustering of text data, and then take the clustering center as the initial center of FCM algorithm to do clustering. This algorithm overcomes deficiency that the clustering results of FCM algorithm is sensitive to outlier and easily influenced by the initial clustering center. The simulative experiments show that this algorithm has better comprehensive performance and better clustering effect.
Keywords/Search Tags:fuzzy clustering, Latent Dirichlet Allocation(LDA), word co-occurrence, similarity computing, granular computing, ant colony algorithm, fuzzy C-means (FCM) algorithm
PDF Full Text Request
Related items