Font Size: a A A

Contextual Topic Modeling

Posted on:2018-12-01Degree:MasterType:Thesis
Country:ChinaCandidate:D Y ChangFull Text:PDF
GTID:2348330542965215Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Topic model is the most effective tool for analyzing large-scale document collections,and it can effectively extract useful semantic information from a large number of unstructured text data.Since the Latent Dirichlet Allocation(LDA)has been put forward,it attracted a lot of scholars' s attention and they proposed many inference methods.These models have been applied in many areas through continuous improvement and have good practical results.LDA is an unsupervised model,which can automatically extract semantic information from documents and find semantic association behind it.It is also a model based on the "bag-of-words" hypothesis,which assumes that the document is a combination of word frequency,regardless of the order of words,resulting in the semantic confusion of words.For example,same word in different parts of a document may have different meanings according to their context.Although this method simplifies the complexity of LDA,but leading to poor predicting performance,which creates opportunity for further improvements.In this paper,two novel topic models are proposed eliminating the “bag-of-word” hypothesis which can improve the predicting performance.The order of words is taken into consideration in these two models.(1)Sliding-window based Topic Model(SWTM): This model cuts the document into smaller fragments according to window size and stride size,and calculates topic distribution for all the words in each window.The basic idea is that the topic of a word in the document has closer relationship with the topic of several words nearby,and is influenced by them.Each word in the document belongs to different window which brings different contextual information.Empirical results show that SWTM reduces the average perplexity by 25% ~ 54%,and convergence rate is also improved.(2)Centroid-word based Contextual Topic Model(CCTM): SWTM does not completely exclude "bag-of-words" hypothesis.Therefore,Centroid-word based Contextual Topic Model is proposed.This model take each word as a pivot and extend some word as its context information,and calculate topic distribution.Same word in the document has different topic distribution due to different context information.Empirical results show that CCTM reduce the average perplexity by 9% comparing to SWTM.(3)The above two models are designed for processing offline data and they will load all data into memory.Out of memory and long-time training may happen when training big data.So online sliding-window based topic model(OSWTM)and online centroid-word based contextual topic model(OCCTM)are proposed,which are the variations of SWTM and CCTM for data streams.Empirical results show that OSWTM and OCCTM reduce the average perplexity by 24% to 55% and 37% to 63% comparing to other online topic models.
Keywords/Search Tags:topic model, latent dirichlet allocation, sliding window, centroid word, context information
PDF Full Text Request
Related items