Font Size: a A A

Research On Topic Model Over Short Texts With Incorporation Of Word Embedding

Posted on:2018-07-14Degree:MasterType:Thesis
Country:ChinaCandidate:K YuFull Text:PDF
GTID:2428330623450897Subject:Engineering
Abstract/Summary:PDF Full Text Request
Short texts are one main existence form of user-generated content(UGC)on internet.They have tremendous value due to their huge data volume and extremely high growth speed.Short texts' data sparsity problem makes it difficult to find out their document-level word co-occurrence patterns,that's why conventional topic models,such as LDA,experience a large performance degradation over short texts.Therefore,it has been a hot spot in text mining to solve the data sparsity of short texts and infer their implicit topics with other auxiliaries.As a derivative of neuro probabilistic language model,word embedding is the base of most NLP tasks,it can well express semantic similarity of words in vector space,and this character makes it a hot direction using word embedding for assisting with short texts mining.In this paper,we propose two strategies to incorporate topic model with word embedding,and apply them to Biterm Topic Model(BTM)to assist with revealing the hidden topics of short texts.The first strategy is to increase the probability that semantically similar words belong to the same topic,an experiential thinking is that semantically or grammatically similar words are more likely to occur in the same topic.So our approach is increasing the frequency that similar words of current sampled word occur in the same sampled topic,during the process of inferring the model parameters with Gibbs sampling,the similar words is obtained based on word embedding.In that case the probability that semantically similar words belong to the same topic is promoted.Moreover,in the generating process of BTM's biterms,the hypothesis that both words of a biterm belong to the same topic is too strong.Therefore,a word must be separated into topical word and general word.This could be judged by the word's probability distribution over topics.Meanwhile only the frequency of topical word's semantically similar words can be promoted.Based on this strategy of incorporating topic model with word embedding,we proposed promotion-BTM model.Experiments on three real-word datasets show that our model exceeds the baseline model BTM on all evaluations,by topic coherence and documents classification and documents clustering.The second strategy is to constrain the distribution of topical words by clustering the words in word vector space.Because the words highly relevant to a topic are usually a small part of vocabulary,in the generating process of document words,after a topic being chosen,selecting words from the entire vocabulary is too extensive.We use k-means algorithm to cluster words in word vector space to form a plurality of clusters on words,topical words are constrained in these clusters.Thinking from another perspective is adding a constraint layer between topics and words,and altering the topic's multinomial distribution over words into two distributions,topic's multinomial distribution over constraints and constraint's multinomial distribution over words.In that case,the generating process of document words become three steps,firstly a topic is selected,then a constraint is chosen from the topic,and finally a word is extracted in the constraint.Because similar words are relatively close to each other in the word vector space,they are highly likely to be clustered into one cluster.By constraining the topical word to clusters,the possibility of similar words belong to the same topic is promoted.We apply this strategy of incorporating topic model with word embedding to BTM and proposed constraint-BTM.It shows better performance than the baseline model BTM on multiple evaluations,which demonstrates this way's validity.
Keywords/Search Tags:Short texts, Topic model, Word embedding, Gibbs sampling, Text mining
PDF Full Text Request
Related items