Font Size: a A A

Research On Topic Modeling Method Based On Semantic Distribution Similarity

Posted on:2020-03-04Degree:MasterType:Thesis
Country:ChinaCandidate:Y Y JuFull Text:PDF
GTID:2428330578479412Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Latent Dirichlet Allocation(LDA)is a popular probabilistic topic model,and it can extract latent semantic information from a large number of unstructured document collections.Through topic modeling,LDA can not only analyze the semantically related topic collections from the word representation of the documents,but also obtain the representation of documents in the low-dimensional semantic topic feature space.However,the traditional topic model is based on the "Bag of Words"(BOW)model.It is assumed that the importance of words in the corpus is only related to word frequency,which simplifies the complexity of modeling,but ignores the semantic information and semantic associations that exist in the documents.It makes the topic model face some problems,such as poor interpretability,insufficient semantic coherence,slow convergence of the model and weak expression of the topic features.Against the above problems,the semantic information of words,the semantic association between words and their documents and the semantic information of topic features are integrated into the topic modeling in this paper,in order to improve the quality of latent topics from semantic perspectives.Specifically,the main works are as follows:(1)A topic model based on dynamic weight(DWTM)is proposed.It aims to suppress the influence of word frequency on modeling and avoid topics tending to high-frequency words.The main idea is that the semantic distribution of high-frequency words is more similar to the noise word vector.In the iterative process,the high-frequency words are given smaller weights to reduce their contribution to modeling,and the keywords and low-frequency words are given greater weights.The experimental results show that the DWTM model can obtain more interpretable topics than the traditional topic model,the PMI values on the four public datasets increased by an average of 2.0%to 7.4%and the accuracy of text classification increased by an average of 1.01%to 5.44%.(2)A topic model based on word semantic distribution similarity(WSDSTM)is proposed.It aims to realize the semantic enhancement of the topic model by considering the semantic association between words and between words and their documents in the modeling process.The main idea is that words with similar semantics should have similar probabilistic topic distributions,and the stronger the semantic association between the sampled words and their document,the greater the probability value of the topic being sampled under the document.This model uses Generalized Polya Urn(GPU)model to incorporate word-word and document-topic semantic enhancement matrices to guide modeling.The experimental results show that the WSDSTM model can further weaken the influence of word frequency on modeling from the semantic association level.Comparing with the DWTM model,it can achieve more semantically coherent topics,and the representation ability of topic features is further improved.(3)A topic optimization method based on topic feature semantic distribution similarity(TFOM)is proposed.The method aims to optimize the topics from the perspective of semantic information to measure "good or bad" of topics.Firstly,it uses the method of measuring the importance of the topic in Topic Significance Ranking(TSR).Secondly,the document category information is used to measure the semantic distribution of the topic features.Finally,it calculates the similarity between the topic features and the noise topic.The experimental results show that TFOM can effectively distinguish topics,and enhance the representation ability of topic features.Comparing with the traditional topic model,the accuracy of text classification on the four public datasets increased by an average of 2.99%to 9.29%.
Keywords/Search Tags:Topic Model, Latent Dirichlet Allocation, Dynamic Weight, Semantic Distribution Similarity, Topic Optimization
PDF Full Text Request
Related items