Font Size: a A A

Research On Short Text Topic Model Based On Word Network And Word Vectors

Posted on:2019-07-22Degree:MasterType:Thesis
Country:ChinaCandidate:C TangFull Text:PDF
GTID:2428330545985132Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Topic model is an effective method for text data mining,whose application fields include text mining and personalized recommendation and so on.With the rapid de-velopment of the Internet in recent years,short text data has rapidly increased.It is essential to organize and summarize these data automatically.Conventional models like pLSA and LDA are designed for long text data.However,these models may suffer from the sparsity problem brought by lacking words in short text scenarios.Recent studies such as BTM and WNTM show that using word co-occurrent pairs is effective to relieve the sparsity problem.However,both BTM and WNTM ignore the semantic information between words.Based on this idea,this paper proposes a model named SEREIN which constructs word network on corpus and makes use of word vectors to learn semantic similarity.The work in this paper consists of the following several parts:(1)Faced with the sparsity problem in short texts,a simple and effective short text topic model is proposed in the paper.Different from the existing models,SEREIN alleviates the sparsity problem from the perspective of constructing pseudo documents with word vectors.Experimental results show that SEREIN has greatly improved the existing model;(2)To address the limitation of BTM and WNTM,SEREIN involves semantic representations in pseudo documents construction procedure including quantifying the cooccurrent relationship with similarity,discovering semantically similar but not co-occurrent words by arithmetic relationship of word vectors and involving words with high semantic similarity calculated by word vectors.Experimental results show that the introduction of semantic information is meaningful for improving the performance of the topic model;(3)Based on the idea of the evaluation methods of word vectors,SEREIN is improved by the highrt-quality word vectors.In the meantime,the effectiveness of Word2Vec and flexity of SEREIN are confirmed;...
Keywords/Search Tags:topic model, short text, word network, word vectors
PDF Full Text Request
Related items