Font Size: a A A

Exploring Entropy-based Term Weighting Schemes In Latent Dirichlet Allocation

Posted on:2018-02-25Degree:MasterType:Thesis
Country:ChinaCandidate:K YangFull Text:PDF
GTID:2348330536978605Subject:Engineering
Abstract/Summary:PDF Full Text Request
Latent Dirichlet Allocation(LDA)is a commonly used topic model in the field of text mining.LDA and its variants have been widely used to discover latent topics in textual documents.However,some of the topics generated by LDA may be uninterpretable due to containing irrelevant words,which we call ‘impurity words'.These ‘impurity words' will lead to poor interpretability of the LDA generated topics,which will eventually result in low quality results.A possible way to improve the quality of topics is to reduce the number of these ‘impurity words' in topics.However,there is only a small amount of work exploring the causes of these words.In this paper,we explore the causes of these words.After our experimental observation,we found that some of the words in the document tend to bring ‘impurity words' into topics.We also find that these words have obvious characteristics: they tend to scatter across many topics,resulting in their low ability to discriminate diff erent topics.We name this kind of words as ‘topic-indiscriminate words'.These ‘topic-indiscriminate words' are an important reason for generating ‘impurity words' in topics.In this paper,we propose a new model called TWLDA,which provides a way to find out these words and reduce their impact on LDA results.Firstly,we use entropy-based term weighting schemes to assign lower weights to topic-indiscriminate words.Secondly,we propose a method to reduce the effect of low-weight words on LDA in Gibbs Sampling process.The method can reduce the number of ‘topic-indiscriminate words' in the document,then their ability to bring ‘impurity words' will also be reduced and finally reduce the number of ‘impurity words' in each topic.However,our proposed TWLDA is a variant model of standard LDA model,thus it cannot be applied to other variants of LDA.We expand TWLDA into an algorithm framework called TWFW(Term Weighting Framework),which can be applied to all variants of LDA models.Our experimental results show that the proposed framework can improve the performance of LDA and its variants.Finally,we apply TWFW to a practical project and show that it has a good effect in engineering applications.
Keywords/Search Tags:Latent Dirichlet Allocation, Topic Model, Term Weighting Scheme, Entropy
PDF Full Text Request
Related items