Font Size: a A A

Extending For Topic Model Used In Short Text Data Mining

Posted on:2016-06-22Degree:MasterType:Thesis
Country:ChinaCandidate:N DaiFull Text:PDF
GTID:2308330461992016Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
We are living in an information revolution era with social media gradually replacing the traditional media. Online social media platform, such as Facebook, Twitter, as well as the domestic companies Sina and Tencent. They created an information transmission model that general user could be a publisher or disseminator. It makes the generation of information and communications are more quick, convenient and fast than before. With the popularity of mobile intelligent terminal, more and more users are willing to express their opinion anytime and anywhere, sharing their experiences, even expressing their political views and so on. Social media has become a valuable data source of public emergency, public sentiment and personal perceptions.General Internet users can produce massive (TB level) of short text data through social media every day. The amount of information is unmatched by traditional media. So, research how to discover useful information is the current major challenges from these massive short text in social media. It has become an international academic research focus.Topic model be used for text mining has proven to be a very effective means. With the development of instant messaging, mining features of these massive short texts are also becoming increasingly important. However, because of the sparsely of short text, mining features of short text by traditional topic model (like, LDA-Latent Dirichlet Allocation) cannot work well.Based on the research topic model based on LDA and BTM (Biterm Topic Model), we propose a new topic model for massive short text. In this topic model, we model the document by "biterm" co-occurrence instead of the traditional word co-occurrence. By this way we can alleviate the sparsely of short text and improve the mining results of topic model. At the same time, it retains some outstanding characteristics of the traditional topic model.The main work of this thesis is reflected in two aspects:(1) Combing the development process topic model, we detailed study of LDA topic model and BTM Topic model, including model generation process, model performance and their respective causes. Combined with characteristics of their own short text, we propose an extending topic model by introduce "biterm" which come from BTM topic mode to LDA topic model, referred as bLDA. In bLDA topic model generation process, we model the document by "biterm" co-occurrence instead of the traditional word co-occurrence. It can alleviate the data sparsely problem which leading to model performance degradation. Simultaneously, the bLDA topic model unlike BTM topic model which implicitly capture the corpus-level "biterm" co-occurrence, leading to increase of topic dimension and time complexity. The bLDA topic model just capture the document-level "biterm" co-occurrence like LDA topic model, retaining the advantages of dimension and time complexity of traditionalist model.(2) In this thesis, the experiments have three data sets. Baidu Q&A datasets, Sogou Laboratory’s news titles datasets and news titles from Phan’s paper datasets. We conduct experiments on this three real-world short text datasets. Experimental results show that bLDA topic model has good performance in mining massive short text topic features than LDA topic model and BTM topic model.
Keywords/Search Tags:topic model, LDA, BTM, Data sparsity
PDF Full Text Request
Related items