Font Size: a A A

Research On Text Spectral Clustering Algorithm Based On Hidden Topics

Posted on:2019-01-16Degree:MasterType:Thesis
Country:ChinaCandidate:X B QiuFull Text:PDF
GTID:2438330566473400Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Text clustering can automatically classify unlabeled text data without any prior knowledge.It is an unsupervised method.The Spectral Clustering(SC)algorithm is usually considered as a highly efficient algorithm.It is based on the spectrum theory.Spectral clustering algorithm maps datasets into undirected weighted graphs,which translates the division of data categories into graphs.Compared with common K-means and other algorithms,spectral clustering algorithm can perform clustering processing on sample space of any shape and can converge to a global optimal solution.It can transform relatively complex clustering problems into relatively simple algebraic problems.Spectral clustering algorithm also has some shortcomings,the most common is the construction of similarity matrix and the need to determine the number of clusters in advance.The effectiveness of the spectral clustering algorithm depends on the similarity matrix.The traditional text similarity calculation method uses the vector space model which based on feature word vectors.Vector space models have the disadvantages of high-dimensional sparseness and lack of semantic information.For the problem of text similarity calculation,this article uses word parts and weights to select feature words that can better reflect text information,and reduces feature words.By introducing the latent topic information of the Latent Dirichlet Allocation(LDA)model,the similarity of feature words and topics is weighted to calculate the similarity of the text.For spectral clustering algorithm,the number of clusters needs to be determined in advance.Based on the NJW algorithm,this paper uses the Eigen gap method to solve the distance between the eigenvalues of the Laplacian matrix to obtain the number of text clusters.This paper proposes an adaptive feature weighting NJW algorithm(NJW,AFW-NJW).The algorithm makes full use of lexical features and topic features to calculate text similarity.Using eigen gap method to determine the cluster number of spectral clustering algorithm.Since the LDA model needs to manually determine the number of topics,this paper uses the average similarity between topics to determine the optimal number of topics.Through experiments,this paper verifies that the adaptive LDA model automatically determines the number of topics and the effectiveness of AFW-NJW to automatically determine the number of clusters,then determines the weight of the topic features when calculating the text similarity.The proposed AFW-NJW algorithm is compared with the traditional K-means algorithm and NJW algorithm.The results show that the AFW-NJW algorithm has higher NMI than the K-means algorithm and NJW algorithm.
Keywords/Search Tags:Text clustering, spectral clustering, topic model, feature weighting
PDF Full Text Request
Related items