Font Size: a A A

Topic Modeling For Short Texts With Auxiliary Word Embeddings

Posted on:2018-10-05Degree:MasterType:Thesis
Country:ChinaCandidate:H R WangFull Text:PDF
GTID:2348330515989690Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
For many applications that require semantic understanding of short texts,inferring discriminative and coherent latent topics from short texts is a critical and fundamental task.Conventional topic models largely rely on word co-occurrences to derive topics from a collection of documents.However,due to the length of each document,short texts are much sparser in terms of word co-occurrences.Data sparsity therefore becomes a bottleneck for conventional topic models to achieve good results on short texts.On the other hand,when a human being interprets a piece of short text,the understanding is not solely based on its content words,but also her background knowledge(e.g.,semantically related words).The recent advances in word embeddings offer effective learning of word semantic relations from a large corpus.Exploiting such auxiliary word embeddings to enrich topic modeling for short texts is the main focus of this paper.To this end,we propose a simple,fast,and effective topic model for short texts,named GPU-DMM.Based on Dirichlet Multinomial Mixture model,GPU-DMM promotes the semantically related words under the same topic during the sampling process by using the generalized Polya urn model.In this sense,the background knowledge about word semantic relatedness learned from millions of external documents can be easily exploited to improve topic modeling for short texts.Through extensive experiments on two real-world short text collections in two languages,we show that GPU-DMM achieves comparable or better topic representations than state-of-the-art models,measured by topic coherence.According to the result of Topic model,each document can be represented as topic distribution,where each document is regarded as a vector.We put these vector to the standard classifier such as Support Vector Machine.The learned topic representation by GPU-DMM leads to the best accuracy in text classification task,which is used as an indirect evaluation.Last,we evaluate the efficiency by comparing time cost per iteration of each model,the result demonstrate that GPU-DMM can obtain comparable efficiency.
Keywords/Search Tags:Topic Model, Short Texts, Word Embeddings
PDF Full Text Request
Related items