Font Size: a A A

Shannon Entropy And Mutual Information For Topic Optimization Research

Posted on:2018-08-17Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiFull Text:PDF
GTID:2348330512477219Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the advent of the Big Data Age,the problem of information overload has become increasingly serious.Researchers have been working on how to extract effective information from large amounts of data.The purpose of the topic model is to extract potential low-dimensional topics from discrete text with huge amount of data,which can solve the problem of information overload.However,with the increase of the corpus to be processed,the number of topics generated by the topic model is more and more.There are noisy topics that did not catch much semantic information in the topics trained by topic model.The researchers gradually pay attention on how to remove noisy topics.In this paper,we focus on the issue of improving the quality of topics,studying the optimization methods and a series of topical tuning strategies.We apply the Shannon entropy and mutual information theory into the text corpus,on the purpose of measuring the quality of topics and eliminating some semantically unknown background words in topics.At the same time,our methods also do research on the topics itself to improve the whole quality of topic space.The thesis starts work following the two points:The topic word optimization,words in the topics directly affect the interpretability of the topics.To remove the background words in topics can guarantee the quality of the topics and thus conducive to downstream training.In this paper,the Shannon entropy and mutual information method are applied to the corpus based on the existing label text sets,and the background words are eliminated by the statistical information feature of the words.Finally,the improved topics are applied to text classification tasks.The topic optimization,eliminate some noisy topics without significant semantics in order to improve the quality of the topics.Based on the topic words optimization,this paper applies Shannon entropy and mutual information on topics to study the topics' feature and the relationship between the topics and the category.By which we can measure the quality of topics and judge noisy topics.Finally,the improved topics are applied to text classification tasks.In this paper,Shannon entropy and mutual information are used to eliminate the noisy words in topics and noisy topics respectively,which can reserve the semantic feature of the topics to the greatest extent,and complete the optimization of both the topic words' space and the topic space.Through text categorization experiment,the quality of the topic model after sparseness is verified.
Keywords/Search Tags:Topic Model, Topic tuning, Shannon entropy, Mutual information
PDF Full Text Request
Related items