Research And Application Of Text Clustering Based On Topic Model

Posted on:2021-01-15

Degree:Master

Type:Thesis

Country:China

Candidate:T J Li

Full Text:PDF

GTID:2427330620463217

Subject:Applied Statistics

Abstract/Summary:

PDF Full Text Request

Since the beginning of the new century,Internet technology and new media technology have developed rapidly,and the society has entered an era full of massive data information.In the information age,a variety of new Internet media platforms with information carriers have emerged,such as Twitter,weibo and headlines.The carrying medium of text information has gradually changed from paper newspapers to online digital media,and online text has increasingly become a major form of information media communication in modern society.In recent years,the number of web texts(such as news,blogs,etc.)has increased explosively,producing a huge amount of semi-structured or unstructured text data.The hot issue in the field of text mining is how to extract valuable information from the massive text data generated by these Internet platforms.Based on the topic model,this paper conducts research on network text clustering,which improves the defects of the traditional vector space model in mining the underlying relationship within the text semantics,and overcomes some serious high-latitude and sparse problems in the text clustering process of the traditional vector space model.By combining the text similarity calculated by the LDA theme model with the text similarity calculated by the VSM model based on tf-idf feature extraction,the paper considers the clustering analysis of network text combining feature words and topic information,and calculates the text similarity according to certain feature proportion coefficient.It can not only greatly improve the quality of clustering results,but also keep the stability of clustering results at a relatively high level.At the same time,the study found that on the one hand,the LDA model of fuzzy concept or subject keywords to distinguish with theme ambiguity has certain defects,on the other hand bag model ignores the existing word document the sequence of sex between words,according to the above two shortcomings in this paper,on the basis of previous studies,this paperproposes a model based on word vector model and the LDA theme of text clustering algorithm,the document-theme information mapped to word2 vec space,and set the topic keyword semantic similarity between threshold,combined with the theme and the particle size,particle degree of words for text clustering,The similarity semantic information and contextual word order information between words in the LDA model and the word vector Word2 vec model are effectively used,and the advantages and disadvantages of the two text representation models are considered comprehensively,so as to improve the effect of text clustering.In order to test the effectiveness of the method proposed in this paper,the text clustering method proposed in this paper has been proved to be significantly improved in the precision,recall and F-measure of text data sets in six different news categories by crawling the contents of the news website through toutiao.Finally,based on the dimension reduction of T-SNE,this paper makes a visual analysis of the theme words of the text corpus,and effectively mines the key points under each theme words,and ensure that the theme words under each topic have a high degree of semantic similarity.

Keywords/Search Tags:

Text clustering, Topic model, Word2vec, Textual similarity, T-SNE

PDF Full Text Request

Related items

1	Variance Analysis In Teaching Evaluation Themes Of Students With Different Majors Based On LDA Model
2	Penalized Matrix Decomposition And Its Application In Text Topic Clustering
3	Research Of Text Representation Method Based On Co-occurrence Analysis
4	Research On Tag-Topic Identification And Community Mining In Social Network
5	A Study On The Recruitment Market Of Data Analysis Based On Text Mining
6	News Recommendation Method Based On Improved Similarity And User Clustering
7	A Research On Topic Evolution Based On LDA And Word2vec
8	Research On The Course Recommendation Based On Word2Vec And TF-IDF
9	Research On Scientific Document Clustering And Topic Evolution Based On Citation Networks
10	Study On Accurate Recommendation Of E-commerce Graduates For Employment