Font Size: a A A

Research And Application Of Text Clustering Based On Topic Model

Posted on:2021-01-15Degree:MasterType:Thesis
Country:ChinaCandidate:T J LiFull Text:PDF
GTID:2427330620463217Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
Since the beginning of the new century,Internet technology and new media technology have developed rapidly,and the society has entered an era full of massive data information.In the information age,a variety of new Internet media platforms with information carriers have emerged,such as Twitter,weibo and headlines.The carrying medium of text information has gradually changed from paper newspapers to online digital media,and online text has increasingly become a major form of information media communication in modern society.In recent years,the number of web texts(such as news,blogs,etc.)has increased explosively,producing a huge amount of semi-structured or unstructured text data.The hot issue in the field of text mining is how to extract valuable information from the massive text data generated by these Internet platforms.Based on the topic model,this paper conducts research on network text clustering,which improves the defects of the traditional vector space model in mining the underlying relationship within the text semantics,and overcomes some serious high-latitude and sparse problems in the text clustering process of the traditional vector space model.By combining the text similarity calculated by the LDA theme model with the text similarity calculated by the VSM model based on tf-idf feature extraction,the paper considers the clustering analysis of network text combining feature words and topic information,and calculates the text similarity according to certain feature proportion coefficient.It can not only greatly improve the quality of clustering results,but also keep the stability of clustering results at a relatively high level.At the same time,the study found that on the one hand,the LDA model of fuzzy concept or subject keywords to distinguish with theme ambiguity has certain defects,on the other hand bag model ignores the existing word document the sequence of sex between words,according to the above two shortcomings in this paper,on the basis of previous studies,this paperproposes a model based on word vector model and the LDA theme of text clustering algorithm,the document-theme information mapped to word2 vec space,and set the topic keyword semantic similarity between threshold,combined with the theme and the particle size,particle degree of words for text clustering,The similarity semantic information and contextual word order information between words in the LDA model and the word vector Word2 vec model are effectively used,and the advantages and disadvantages of the two text representation models are considered comprehensively,so as to improve the effect of text clustering.In order to test the effectiveness of the method proposed in this paper,the text clustering method proposed in this paper has been proved to be significantly improved in the precision,recall and F-measure of text data sets in six different news categories by crawling the contents of the news website through toutiao.Finally,based on the dimension reduction of T-SNE,this paper makes a visual analysis of the theme words of the text corpus,and effectively mines the key points under each theme words,and ensure that the theme words under each topic have a high degree of semantic similarity.
Keywords/Search Tags:Text clustering, Topic model, Word2vec, Textual similarity, T-SNE
PDF Full Text Request
Related items