Font Size: a A A

A Study Of Short Text Topic Models Based On Information Of Word Embeddings

Posted on:2019-11-15Degree:MasterType:Thesis
Country:ChinaCandidate:R FengFull Text:PDF
GTID:2428330566984151Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Based on word co-occurrence patterns,classical topic models can distill high-quality topics from long text collections.However,due to the sparse co-occurrence patterns,classical topic models often fail to distill semantically coherent topics from short text collections.Word embeddings trained from large corpus are encoded with general semantic and syntactic information of words,hence they can be supplementary knowledge for the sparsity of word co-occurrence patterns to guide topic modeling for short text collections.However,word embeddings are usually trained from large external corpus,the encoded information is not necessary suitable for training datasets of topic models which is often neglected by proposed methods.Based on GPU-DMM model,in this work,the ECTM model is proposed which leverages both word embeddings and local information for topic modeling.ECTM distills semantic similarity information between words based on word embeddings,and filters this semantic similarity information with the help of PMI learned from training collections.In the parameter inference process of ECTM,semantic similarity information between words can be further leveraged to enhance topic coherence by the sampler.Meanwhile,there are several hyper-parameters in the ECTM model need to be adjusted,and the simple assumption of ECTM that each short document contains only one topic may be limited under some circumstance,which restrict usability of this model.Based on the ECTM,the IECTM model is proposed.IECTM reduces the number of hyper-parameters need to be adjusted in ECTM,and also looses the restriction that each short document contains only one topic.In IECTM,each short document can contain more than one topic.Due to the limitation of document length,each short document is forced to focus on several topics and number of topics in each short document is decided by its content,which can be realized by the spike and slab prior.To validate effectiveness of ECTM and IECTM,experiments on several real world short text collections are employed.And experimental results indicate that ECTM and IECTM model can distill semantic coherent topics in most cases,which exhibits the high availability of these models.
Keywords/Search Tags:Topic Model, Short Text Collections, Word Embedding, Local Context Information
PDF Full Text Request
Related items