Topic Modeling For Short Texts With Auxiliary Word Embeddings

Posted on:2018-10-05

Degree:Master

Type:Thesis

Country:China

Candidate:H R Wang

Full Text:PDF

GTID:2348330515989690

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

For many applications that require semantic understanding of short texts,inferring discriminative and coherent latent topics from short texts is a critical and fundamental task.Conventional topic models largely rely on word co-occurrences to derive topics from a collection of documents.However,due to the length of each document,short texts are much sparser in terms of word co-occurrences.Data sparsity therefore becomes a bottleneck for conventional topic models to achieve good results on short texts.On the other hand,when a human being interprets a piece of short text,the understanding is not solely based on its content words,but also her background knowledge(e.g.,semantically related words).The recent advances in word embeddings offer effective learning of word semantic relations from a large corpus.Exploiting such auxiliary word embeddings to enrich topic modeling for short texts is the main focus of this paper.To this end,we propose a simple,fast,and effective topic model for short texts,named GPU-DMM.Based on Dirichlet Multinomial Mixture model,GPU-DMM promotes the semantically related words under the same topic during the sampling process by using the generalized Polya urn model.In this sense,the background knowledge about word semantic relatedness learned from millions of external documents can be easily exploited to improve topic modeling for short texts.Through extensive experiments on two real-world short text collections in two languages,we show that GPU-DMM achieves comparable or better topic representations than state-of-the-art models,measured by topic coherence.According to the result of Topic model,each document can be represented as topic distribution,where each document is regarded as a vector.We put these vector to the standard classifier such as Support Vector Machine.The learned topic representation by GPU-DMM leads to the best accuracy in text classification task,which is used as an indirect evaluation.Last,we evaluate the efficiency by comparing time cost per iteration of each model,the result demonstrate that GPU-DMM can obtain comparable efficiency.

Keywords/Search Tags:

Topic Model, Short Texts, Word Embeddings

PDF Full Text Request

Related items

1	Sparse Topic Models For Short Text
2	Research Of Topic Model-based Approaches For Sentiment And Topic Modeling On Texts
3	Research On Topic Model Over Short Texts With Incorporation Of Word Embedding
4	Topic Model For Short Texts Based On Word Triangles
5	A Study Of Short Text Topic Models Based On Information Of Word Embeddings
6	Research Of Joint Topic Sentiment Analysis Based On Word Embeddings Probability Model
7	Research On Topic Detection Method Of Complex Short Text Based On Topic Model
8	Algorithmic Studies On Relation Extraction From Chinese Short Texts
9	Word Embeddings Towards Text Classification Of Emotion And Topic
10	Topic Extraction From Short Texts On Social Media