Font Size: a A A

Research On Topic Modeling With Improved Term Weighting

Posted on:2019-02-19Degree:MasterType:Thesis
Country:ChinaCandidate:A ZhangFull Text:PDF
GTID:2428330548959297Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the development of the internet,especially in recent years,the mobile internet services become comprehensive and polymorphic,text data shows explosive growth,users can easily create and browse content.Mining effective information from massive text data has become a hot research field in machine learning.Topic model can effectively mine potential topic information from corpora,and is widely used in various fields of text analysis.However,due to the hypothesis of the bag-of-words model,the traditional topic model only considers the frequency of words in the document.It can not accurately reflect the contribution of words in topic modeling.High frequency words of resulting in training often contain some words with low discrimination and some of these topics are not clear meaning.Term weighting can distinguish the contribution of words,so it is widely used in various fields of text analysis.In topic modeling field,log-LDA,PMI-LDA and BDC-LDA are applications of term weighting.But,those models are not good enough because of ignored information word.Aim at the topic meaningless problem,this paper proposes Entropy-based term Weighting(EW)which uses the context of word and the theory of information entropy.In order to obtain more robust term weights,we propose Combination of EW(CEW)base on two existing term weights and apply CEW to traditional topic models.The main tasks are as follows:1.Analysis the research status of topic modelsTopic models describe the process of text generation,which can mine potential topics of text.But there is a problem that topic models will generate meaningless topics.After topic modeling on a corpus,we choose high frequency words of topics for analysis and find two reasons for the problem: First,some low discrimination words which have no effect on defining topic's semantic are distributed among many topics' representative words.Second,some informative words which can indicate the meaning of the topic are appear in later position.Existing term weighting topic models mitigate the problem by reduce the influence of low discrimination words in topic modeling,but they ignore the importance of informative words.2.Propose improved term weighting(EW and CW)to solve the problem that informationwords are ignoredIn order to improve the weight of informative words,we propose Entropy-based term weighting(EW).In EW,conditional entropy based on co-occurrence of words is used to measure the amount of information carried by words.In addition,we propose Combination of EW(CEW)which combines log weight considering word distribution in the corpus and BDC weight considering word distribution at the topic level.3.Apply CEW to topic modelsWe apply CEW to DMM and LDA.The training process,Gibbs sampling formula and parameter calculation formula of combined models are given.4.Comparative experimentWe conducted experiments on eight data sets.Experiments include two parts:a.Comparison between CEW topic model and several representative topic modelsThe traditional topic model,PMI topic model and BDC topic model were selected as comparison models.Training results of each model were evaluated by topic consistency and clustering effect.Experimental results show that CEW topic model can produce more meaningful topics than other topic models in most cases.From the point of text level clustering effect,CEW topic model produces clustering results more in line with the real category of datasets.b.Comparison of pairwise weight combinationsPairwise weight topic models are evaluated by clustering effect.Experimental results prove the rationality of the combination.EW plays a key role in CEW and accurately reflect the importance of words.
Keywords/Search Tags:Topic Model, Term Weighting, Entropy, LDA, DMM, Gibbs Sampling
PDF Full Text Request
Related items