Font Size: a A A

Short Text Topic Model With Word Discrimination Learning

Posted on:2019-12-11Degree:MasterType:Thesis
Country:ChinaCandidate:Y N NiuFull Text:PDF
GTID:2428330545452252Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the advent of the web2.0 era and the widespread appearance of social media,short texts appear in every corner of Internet.Information retrieval,advertising keywords,page titles,anchor texts,online questions,microblogs,and reviews are all short texts.Short texts are updated fast,easy to produce and rich in content,large in scale,but their own information is sparse.Since the number of words is small,there is not enough statistical inference information.It is a great challenging to understand the semantics of short texts.In addition,because short texts usually do not follow the grammar,traditional natural language processing techniques such as part-of-speech tagging and syntax parsing are difficult to apply directly to short text analysis.However,short text comprehension is the basic research related to the development of artificial intelligence,and it is of crucial importance to many practical application scenarios.Text clustering is the basic method of text analysis.The topic model is an effective method for short text clustering,but it faces high dimensionality and sparseness in short text clustering applications.Among them,the lack of word co-occurrence information makes it difficult for the topic model to mine its underlying structure.The study found that:a small number of words in the short text word vector is particularly important for learning the cluster structure,and relatively speaking,the influence of noise words is also more obvious.Therefore,we propose a framework-based short text topic model with word discrimination learning.Binomial distributions are introduced in LDA,BTM,and GSDMM models to learn the discriminative power of words on the cluster structure.Experimental results on multiple benchmark data sets show that the new word discriminant models LDA-?,BTM-? and GSDMM-?,can not only promote the learning of cluster structure,but also accelerate the convergence of the original model.In order to further improve the effectiveness of the topic model in short text clustering applications,we use a small number of samples with supervised information to guide the clustering process.Using the multi-condition learning theory,the LDA,BTM,and GSDMM models are extended to semi-supervised clustering models Semi-LDA,Semi-BTM,and Semi-GSDMM.It can learn the latent structure of supervised information samples and unsupervised information samples.In this paper,experiments are conducted on several benchmark test data sets as well as compared to the semi-supervised topic models Semi-LDA-?,Semi-BTM-?,and Semi-GSDMM-?.Experimental results show that adding supervision information contributes to improving the effectiveness of the topic model in short text clustering.
Keywords/Search Tags:Short Text, Clustering, Discrimination, Topic Model, Semi-supervised
PDF Full Text Request
Related items