Font Size: a A A

Topic Optimization Method Based On Pointwise Mutual Information

Posted on:2014-05-20Degree:MasterType:Thesis
Country:ChinaCandidate:B ZhaoFull Text:PDF
GTID:2268330392969050Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In today’s world, with the continuous advancement of information technology,the internet has become the most widely used in the world, the most informativeinformation library. Meanwhile, the various types of information resources at afaster rate growth and showing the characteristics of the mass, which data is mainlyin the form of text appear. The amount of information has to meet the needs ofpeople, but how to efficiently manage and use the vast amounts of data as there is anurgent need to address the problem, which has contributed to the study of the relatedareas of text classification. Text classification technology research core contentconsists of two parts: classification model and text representation. The method ofthe text representation can be divided into two types; one type is to introducelinguistic features to improve the performance of the text classification. Anothertype is to use of statistical methods to dig out the theme of the text information. Theformer due to the need for more complex linguistic characteristics of treatmentwhich reduces the efficiency of the entire system, so that it’s practical affected. Atypical representative of the latter is the PLSA semantic model as well as the LDAsemantic model. The semantic model is a probabilistic model, based in the statisticaltheory to model the “document-theme-word” latent semantic dataset.In this paper, we propose Latent Dirichlet Allocation based on the Point-wiseMutual Information and the Laplace Score theme selection algorithms based on thenearest distance. LDA model is not a discriminate model, but a generative model.Potential theme layer will be got in the process of generating text via the EMalgorithm. However, in the process of using the original LDA algorithm generatestexts; the algorithm will treat every word in the text as equally important, this willcause the theme to the high-frequency words tilt, while also causing the thematicoverlap. The main contribution of this paper is three things: First, propose a LDAmodel based on point-wise mutual information, the model is able to solve theproblem of topic tilting to the high-frequency words and themes overlap, allows usto extract the text theme which can better characterize a text. The experiments showthan the algorithm of this topic is feasible. Second, the article also extracted fromtwo perspectives to evaluate the theme. One is according to the readability andconsistency of the topics included the words, the other is according to the distinctiveand independence of the topic from the model. Can clearly see from the experiment,the LDA topic model based on point-wise mutual information is better than theoriginal LDA topic model according to the readability, consistency, or distinction, independence of the theme. Third, a theme exist advantages and disadvantages, asthe characteristics of the text when the topic vector, each dimension of the vectorshould be not same, subject propose LS algorithm based on the shortest distance tocalculate the theme of the right value which been applied to text application.
Keywords/Search Tags:topic model, Latent Dirichlet Allocation, LDA topic model based onPoint-wise mutual information, topic quality, topic weighted
PDF Full Text Request
Related items