Font Size: a A A

Block Bayesian Sparse Topical Coding

Posted on:2019-02-14Degree:MasterType:Thesis
Country:ChinaCandidate:H L ShiFull Text:PDF
GTID:2428330545499747Subject:Software engineering
Abstract/Summary:PDF Full Text Request
At present,the explosive growth of the information and especially the rapidly development of short texts,such as Weibo and Twitter,has greatly facilitated the dissemination of information.It is fundamental to learn the sparse low-dimensional expressions of short texts from large-scale corpora for latent topic extraction.However,the characteristics of short document length,a very large vocabulary,a broad range of topics,and snarled noise,have become the great challenge for short text topic mining.Currently,there are two main methods to mine valuable latent semantic information of short texts:the traditional probabilistic topic models(PTMs)and non-probabilistic topic models(NPMs).However,PTMs lack the posterior sparseness mechanism that directly controls learning expression,which makes it difficult for PTMs to achieve satisfactory results when modeling short text topics.Although NPMs can directly control the sparseness of short texts to mine potentially valuable information,they have different limitations in mining potential semantic information.For example,computational efficiency is often slow,and sparseness is not sufficient when mining potential semantic information.In order to solve the above problems,researchers proposed a lot of improvement measures to mine the potential information of the short text.However,there are still some defects that cannot be very useful to tap the valuable information.The rapidly development of new technique such as block sparse Bayesian learning and word embeddings,brings new opportunities for the topic mining of short texts.To solve the problem of the sparsity issue in short texts,this paper presents a model Block Bayesian Sparse Topical Coding based on Word-embedding models and other techniques to mine sparse representations of short texts.The work of this article is mainly divided into the following three parts:Firstly,this paper uses the word embeddings to realize the vectorized representation of words,and combines the k-means clustering algorithm to cluster words,which makes the model to solve the problem of data sparsity,improve the sparse representation of vector matrix,and promote the block structure division in the vector matrix.Secondly,the sparse Bayesian learning algorithm is used to learn the sparse low-dimensional representation of short texts.Block sparse Bayesian learning algorithm can directly control the sparsity of document encoding and word encoding as NPMs,and learn the relationship between the vector building block structure information.Finally,the topic semantics can be obtained by learning sparse low-dimensional expression in short texts.Finally,this paper optimizes the word codes and the topic dictionary to obtain the optimal solution.According to the sparse Bayesian learning algorithm,the word vector learning process,the word codes and the topic dictionary are optimized through the maximum expectation algorithm and the topic dictionary learning method.The experimental results of this paper on 20 newsgroup datasets show that the block-BSTC model can achieve the sparse word codes and improves the accuracy of document classification.
Keywords/Search Tags:Sparse Topical Coding, Block Bayesian Sparse Learning, word embeddings, k-means, word code
PDF Full Text Request
Related items