Font Size: a A A

Research On Topic Modeling For Short Text With Enriched Feature Representation

Posted on:2018-09-02Degree:MasterType:Thesis
Country:ChinaCandidate:W H LeiFull Text:PDF
GTID:2348330536988241Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of online social media and e-commerce,online applications such as microblogging,moments,and product reviews have created a lot of short text.However,it is still a challenge work to mine valuable knowledge efficiently from those short texts.The topic model is an efficient way to mine valuable knowledge from massive text data.It discovers the topic structure of the text by discovering co-occurrence information between the words at the document level.However,due to the sparse problem of the representation for short text,it will encounter the problem of lack of coexistence mode by applying the topic model approach to short texts.In addition,due to the real-time and dynamic nature of social short texts and the rapid increase in the number of social short texts,it is necessary to explore efficient parallel computing modelsof topic modeling for large-scale social texts and capture the real-time characteristics inherent in social texts.In view of the above problems,this paper has carried out in-depth study and has mainly completed the following work:1.In order to better mine topicsof short texts,we study the problem of feature extraction for short texts.We propose a new feature construction method by introducing frequent text pattern mining.We then present a new frequent pattern mining algorithm(PSTR)to enhance text feature representation,which can capturesemantic relations and co-occurrence patterns at the corpus level.2.Based on the above work,this paper further studies the topic modeling method(PSTR-LDA)under the new feature representation.In the new feature space,we assume that topic identity among the words,i.e,identical topic assignments for the constituent words that form a pattern.This assumption truly reflects the topic dependence among constituent words.This model uses Gibbs sampling to inference the parameters.Experiments on different genres of corpus show that such a PSTR-LDA can discover more prominent and coherent topics with different probabilistic topic models,and achievesignificant performance improvement on several evaluation metrics.3.In order to solve the problem of topic modeling for large-scale social short texts,this paper studied the parallel LDA modeling method and the dynamic topic model that can capture the dynamic characteristics of the topic.After that,a dynamic topic modeling method for large-scale text sets is proposed,which is based on the data decomposition and post-clustering method.It divided thethe whole corpus into independent fragments according to different features(e.g.time feature)and modeled the corpus in parallel.Then,it clusteredthe local topic in later stage.Experiments show that compared with DTM its execution time is less and it can capture the dynamic characteristics of the subjectmore effectively.
Keywords/Search Tags:Topic Modeling, Short Text, Text Representation, Frequent Pattern, large-scale text, parallel modeling
PDF Full Text Request
Related items