Font Size: a A A

Research On Embedded Topic Model Construction And Topic Analysis Based On BERT

Posted on:2022-12-26Degree:MasterType:Thesis
Country:ChinaCandidate:Y H WangFull Text:PDF
GTID:2518306779475764Subject:Library Science and Digital Library
Abstract/Summary:PDF Full Text Request
With the development and maturity of deep learning technology,although the text topic model represented by LDA(Late Dirichlet Allocation,LDA)is widely used in the field of text mining,it uses the word bag model to represent the text,and the semantic relationship and grammatical order relationship between words can not be expressed,resulting in the poor interpretability of the generated topic.Although Embedded Topic Model(Embedding Topic Model,ETM)can use static word vectors to reflect the relationship between words,it can represent polysemy words in different contexts as the same vector,which can not solve the problem of semantic cohesion;In addition,ETM uses Variational Auto Encoder(Variational Auto-encoder,VAE)for variational inference and generates topic distribution representation by approximating the posterior distribution of hidden variables in the model.However,VAE ignores hidden variables in the process of variational inference,resulting in no strong correlation between hidden variables and inputs,resulting in insufficient comprehensive representation of document topic distribution learned by the model.To solve the above problems,this thesis proposes an embedded text topic model based on BERT.The specific improvements to the existing topic model are as follows:a.In view of the poor interpretability of the topic generated by the traditional topic model,this thesis proposes to use the Embedded Topic Model(ETM)for topic mining,add the word vector representation on the basis of the Bag-of-words representation in the LDA model,solve the problem that it ignores other features of the text because the Bag-of-words represents less information,supplement the semantic relationship of words in the context,and enrich the text features,and can fit a more interpretable theme.The theme consistency of ETM on 20 Newsgroups English dataset is 0.183 and the theme diversity is 0.780;On the Weibo Chinese data set,the theme consistency is0.125 and the theme diversity is 0.824.b.Aiming at the problem that the Embedded Topic Model can not solve the problem of semantic cohesion,an embedded topic model based on Bert is proposed,which can not only obtain word embedding fully combined with context characteristics,solve the problem of semantic cohesion,but also mine high-quality and fine-grained document topic word representation.Experiments show that BERT dynamic word vector can effectively represent the meaning of polysemous words.The topic consistency of BERT-ETM on 20 Newsgroups English dataset is 0.198 and the topic diversity is 0.910;On the Weibo Chinese data set,the theme consistency is 0.137 and the theme diversity is 0.837,which improves the performance compared with ETM;Moreover,the topic consistency and topic diversity of Wobert-ETM based on Chinese word segmentation on Weibo Chinese data set are 0.172 and 0.908 respectively.Compared with BERT-ETM,the topic consistency and topic diversity are improved by 0.035 and 0.071 respectively,indicating that the BERT model combined with Chinese word segmentation can obtain a more fine-grained topic word representation when processing Chinese materials.c.Aiming at the problem that the VAE ignores the hidden variables,the network structure of ETM is improved,and the Information Maximizing Variational Auto-encoder(Info VAE)is used to improve the topic model,so that the hidden variables can be fully utilized in the training process of variational inference,so as to obtain a more comprehensive topic word representation.The subject consistency and subject diversity of infovae based BERT-ETM on 20 Newsgroups English dataset are 0.245 and 0.932 respectively.Compared with BERT-ETM,the subject consistency and subject diversity are improved by 0.047 and 0.022 respectively.d.In this study,a data set of professional field textbook Software Engineering is constructed,and the experimental model of this study is compared with other models.The experimental results also show that the performance of the model in topic consistency and topic diversity has been improved;At the same time,the professional textbook of Software Engineering is taken as the object of theme analysis and research,hoping to find the key and difficult points of the discipline,and help teachers and students understand the course content.
Keywords/Search Tags:Topic model, Topic analysis, BERT model, Information maximization, Polysemy
PDF Full Text Request
Related items