Font Size: a A A

Cluster Analysis Application And Research Of Text Mining

Posted on:2017-12-12Degree:MasterType:Thesis
Country:ChinaCandidate:H ShengFull Text:PDF
GTID:2348330488482711Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The arrival of Web2.0 era, making the text information on the network showing explosive growth, people in the information required on the Internet to organize Now it takes more and more energy and time, lead to information on how these massive noise from text timely and accurately search for information useful to the user is required to wait one kind of problem. In this context, the use of text clustering technology for large text information filtering and automatic archiving, and extracts the main text feature from this information, can greatly reduce the manual workload Now finishing the document, improve document retrieval efficiency is a very far-reaching significance and application prospects. By studying the density of fast peak search algorithm(CFSFDP) and it is proposed to improve the potential of rapid entropy density peak search algorithm(PE-CFSFDP) based; and on this basis, proposes a fusion of K-means and improved fast density peak search algorithm on UCI data sets and Sogou text corpus verify the improved text clustering algorithm has good stability and accuracy, the details are as follows:First, the rapid density peak search algorithm(CFSFDP) is density-based clustering algorithm for the calculation of the local density of the algorithm for truncated distance triggered manually set for small data set algorithm clustering effect is poor and the sample classification appears one sample dispensing error caused by a series of errors and sample allocation class clusters in the sample overlap other shortcomings. Paper proposes a concept of entropy potential data fields to automatically define the sample local density measurement optimization function(PE-CFSFDP), to objectively determine the cut-off distance based on comprehensive index of potential energy and entropy, more reasonable to calculate the local density, clustering effect making more scientific.Second, for the K-means algorithm randomly k points as initial cluster centers iterate cause instability clustering results, this paper presents a blend of PE-CFSFDP and K-means clustering algorithm of K-CFSFDP.PE-CFSFDP use to characterize the cluster center initialization and automatically selects the k value, to make up for the k-means algorithm given in advance the number of clusters, the initial cluster centers selected sensitive and fall into local minima problems. On UCI data sets and data sets artificial experiments show that fusion algorithm can get better clustering results, and clustering is very stable.Third, we use Sogou text corpus, word by Chinese word breaker, removing stop words and feature extraction for text preprocessing, and then follow to identify the extent of the impact characteristics of the word, in order to establish VSM model and the K-CFSFDP fusion algorithm for text clustering, by precision and F values clustering algorithm results are compared and analyzed. Experiments show that the improved clustering algorithm clustering effect in text mining applications, accuracy and stability has greatly improved.
Keywords/Search Tags:text clustering, k-means algorithm, Clustering by fast search and find of density peaks, text mining
PDF Full Text Request
Related items