Font Size: a A A

Document Topic Clustering Analysis Based On Improved K-means Method

Posted on:2021-04-12Degree:MasterType:Thesis
Country:ChinaCandidate:W ChenFull Text:PDF
GTID:2428330626961133Subject:Applied statistics
Abstract/Summary:PDF Full Text Request
Since the beginning of the 21 st century,with the rapid development of the Internet industry,a variety of chat software and social networking platforms have emerged.With the emergence of more and more text documents on the Internet,relevant information processing technologies emerge and are applied to quickly extract the information they need from massive documents.Text document topic extraction is a typical representative.However,traditional text document topic extraction requires prior knowledge of the number of topics aggregated.However,the actual document collection we get is likely to be a messy document data set with unknown document content and number of document topics,so it is difficult to extract traditional document topics for the actual situation.On the basis of the traditional text document clustering,this paper USES the correlation clustering method and index parameters to improve the correlation,with the purpose of identifying and extracting the subject number of the unknown topic number and the chaotic document data set of the content quickly and conveniently.Firstly,the original document data is numerically processed in this paper,and the daily data can be converted into a sparse matrix.Then,the MDS dimension reduction method is used to extract the relevant feature of the sparse matrix generated by the document topic extraction,so as to reduce the waste of operation space.Then,the improved k-means clustering method is introduced to apply density clustering on the initial clustering center,and the evaluation index of VCVI clustering is verified on the optimal number of clusters to see whether it can reach the optimal number of topics under the condition of unknown number of topics,so as to test its applicability in this paper.After the optimal number of topics is obtained,LDA operation is carried out to extract the last documents of each class,extract the topics of each class of documents,and verify the final results.
Keywords/Search Tags:Improved k-means clustering, MDS dimension reduction, LDA model, VCVI criterion
PDF Full Text Request
Related items