Font Size: a A A

Sparse Topic Models For Short Text

Posted on:2021-12-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q Q XieFull Text:PDF
GTID:1488306461464934Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The continuous development of social media has gradually made it a major platform for Internet users to express their opinions.A large amount of active Internet users publish a large number of short texts such as Weibo or Tweet everyday,which can reflect social public opinion,social hotspots and user interests.Therefore,the semantic analysis and information extraction for these massive short text contents have important research significance and application value.As unsupervised learning method,topic models are widely used in text analysis.Without labeling information,they can effectively learn the latent low-dimensional feature representations at the document level and the topics expressed by texts.However,these methods are not suitable for short texts.Massive short texts in social media usually have the characteristics of spreading rapidly,short document length and snarled noise,which leads to the sparseness of word co-occurrence at the document level in short texts,thus inevitably resulted in the limited performance of these models in short texts.To address the problem,the sparse topic model introduces the sparse hypothesis based on traditional topic model.It assumes that each document is only related to a small number of topics,and each topic is also related to a small number of words.The hypothesis can alleviate the sparseness of short text co-occurrence.Previous methods based on the hypothesis can be divided into two major categories: 1)Methods of the first category mainly focus on introducing sparse priors or auxiliary variables in probabilistic topic models such as LDA to indirectly control the sparsity of the learned latent representations.2)Another approaches focus on introducing sparse regularization constraints in non-probabilistic topic models to directly control the sparseness of latent representations.Compared to traditional probabilistic and non-probabilistic topic models,both methods can learn more semantically interpretable text expressions and topics.However,in terms of sparse modeling and model inferring,these methods still have a series of challenges: 1)Incomplete sparse latent representation learning.Existing sparse topic models usually can only learn sparse representations at one or two levels among documents,topics,and words,and cannot effectively obtain sparse representations at all levels among documents,topics,and words.2)Inaccurate sparse structure modeling.The existing sparse mechanism of sparse topic models fails to capture the correlation between sparse representations.3)Poor model scalability.Due to the hierarchical generation process,the scalability of traditional probabilistic and non-probabilistic topic models are limited.Once a variant requires to be implemented,they have to introduce new variables and re-derive the inference algorithm during the generation process,which seriously limits the generalization ability of the model;4)limited semantic representation ability.Because of its shallow probabilistic and nonprobabilistic generation process,compared with the deep model,the traditional topic model has a limited semantic representation ability,resulted in the weak semantic coherence for their learned representations and topics.To solve the above challenges,this thesis incorporates techniques such as sparse regularization,sparse Bayesian learning,word embeddings,neural network,neural variation inference,and neural language model with the topic model,and proposes a series of sparse topic models for short texts.In detail,there are following four parts of the work in this thesis:1)To solve the problem of incomplete sparse representation learning,this thesis proposes a non-probabilistic sparse topic model(Sparse Topical Coding with Sparse Groups,STCSG).The model uses sparse group lasso regularization constraints to directly control the sparseness of representations at the document,topic,and word level,and proposes an effective Alternating Direction Method of Multipliers(ADMM)based algorithm for solving the non-convexity learning problem of the model and obtain efficient and iterable solution expressions.2)To solve the problem of inaccurate sparse structure modeling,this thesis proposes a new topic model called Bayesian Sparse Topical Coding(BSTC).The model uses hierarchical Laplacian and Normal-Jeffrey sparse priors to effectively constrain the sparseness at the document,topic,and word levels.It can automatically use the semantic correlation between sparse representations to improve the learning of text representations.We also devise efficient expectation maximization and variational inference algorithm to to efficiently approximate the posterior of the model.3)To solve the problem of poor model scalability,combining with deep learning technology,this thesis proposes a novel neural topic model(Neural Sparse Topical Coding,NSTC)and a series of sparse extension models of it.The model uses neural networks to model the generation process of traditional sparse topic models to simplify its complex inference process and improve scalability.In addition,the model introduces semantic relevance information based on word embeddings to constrain the generation of topics in the generation process to further improve the semantic consistency of representation learning.4)To solve the problem of limited semantic representation ability,this thesis proposes a Semantic Reinforcement Neural Sparse Topic Model(SR-NSTM)and its supervised extension model,based on neural variation inference and neural language models.In this method,neural variational inference technology is used to improve the scalability of the topic model,Laplacian prior distribution is used to constrain the sparseness of latent representations,and bidirectional LSTM neural network is used to encode the document level representations to make semantic enhancement and improve the representation learning.The results of our method on several benchmark datasets show that our methods are more effective than previous advanced methods in short text representation learning and semantic information extraction.
Keywords/Search Tags:short texts, topic model, sparse regularization, sparse Bayesian learning, neural network, word embeddings
PDF Full Text Request
Related items