Font Size: a A A

Text Clustering Research Based On The RI Method

Posted on:2016-11-18Degree:MasterType:Thesis
Country:ChinaCandidate:J J WuFull Text:PDF
GTID:2308330470963930Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Along with the rapid development of Internet technology, a variety of text data is growing on the Internet. A large number of redundant, non-standard, content rich text bring great difficulties for our information filtering, browsing and querying. Text clustering has very important role in data mining, the core of the technology is to find out the text representation method which can dig out the potential semantic information between texts and to realize rapid dimension reduction, combined with efficient text clustering algorithm for clustering in the case of unsupervised. However,the current text clustering technology is still not perfect, practical application is limited, so the research of text clustering is of great significance.This paper firstly introduces some technology related to text clustering, mainly including text participle, feature words extraction, text representation, and text clustering algorithm. Aim at analyzing two key modules of text representation and text clustering algorithm in text clustering, put forward a kind of text clustering algorithm based on the RI(Random Index, RI) method. The main research content is as follows:1. In the aspect of text representation, in view of the LSA, PLSA models based on semantic clustering exist problems of that potential semantic clustering features will not be able to good reflected, text clustering accuracy is not ideal due to the high dimension text vector of feature word, we do research on the RI method and combine with the feature weight to generate random index vector for feature word in the text representation. RI method on the one hand, can show the semantic characteristics between the feature words, on the other hands has dimension reduction effect. But the RI method exist semantic resolution between feature words in the text representation,as there are a lot of addition operations between vectors when constructing the context vector of feature word. Therefore, this paper makes improvements both on the random location of the vector elements of the feature context vector constructed by RI method and the calculation of relative feature weights, so as to better reflect the potential semantic clustering between feature words and meet to needs of text clustering effects.2. Based on the RI text representation, in view of the AGNES text clusteringalgorithm exist problem of difficult to choose merge points, we develop an improved K-Means+AGNES text clustering algorithm, aim to get better clustering effect. The improved K-Means+AGNES text clustering algorithm mainly contains two steps.First, generate the best initial cluster number and corresponding clusters for AGNES.Therefore, this paper makes improvements on the K-Means algorithm, on the basis of setting a proper range of initial cluster number, use algorithm based on FMC to adjust the initial cluster number, in order to get the best cluster number, and then generate the initial cluster centers and clusters. Second, regard the best initial clusters as the initial merger points of the hierarchical clustering algorithm AGNES, carry on clustering by AGNES until get the final cluster number.3. In order to validate the effectiveness of the proposed text representation based on the RI method and text clustering algorithm based on the improved K-Means+AGNES, this paper carry on corresponding algorithm testing and results comparing and analyzing. Test and comparative analysis show that RI method has better ability of text representation, the improved K-Means+AGNES clustering algorithm based on RI has better effect on text clustering.Finally, this paper has a conclusion on the full research work briefly, analyzing some of the disadvantages existing in the article, looking forward to the research direction of the future work.
Keywords/Search Tags:text clustering, text representation, RI method, K-Means, AGNES
PDF Full Text Request
Related items