Topic Optimization Method Based On Pointwise Mutual Information

Posted on:2014-05-20

Degree:Master

Type:Thesis

Country:China

Candidate:B Zhao

Full Text:PDF

GTID:2268330392969050

Subject:Computer Science and Technology

Abstract/Summary:

In todayâ€™s world, with the continuous advancement of information technology,the internet has become the most widely used in the world, the most informativeinformation library. Meanwhile, the various types of information resources at afaster rate growth and showing the characteristics of the mass, which data is mainlyin the form of text appear. The amount of information has to meet the needs ofpeople, but how to efficiently manage and use the vast amounts of data as there is anurgent need to address the problem, which has contributed to the study of the relatedareas of text classification. Text classification technology research core contentconsists of two parts: classification model and text representation. The method ofthe text representation can be divided into two types; one type is to introducelinguistic features to improve the performance of the text classification. Anothertype is to use of statistical methods to dig out the theme of the text information. Theformer due to the need for more complex linguistic characteristics of treatmentwhich reduces the efficiency of the entire system, so that itâ€™s practical affected. Atypical representative of the latter is the PLSA semantic model as well as the LDAsemantic model. The semantic model is a probabilistic model, based in the statisticaltheory to model the â€œdocument-theme-wordâ€ latent semantic dataset.In this paper, we propose Latent Dirichlet Allocation based on the Point-wiseMutual Information and the Laplace Score theme selection algorithms based on thenearest distance. LDA model is not a discriminate model, but a generative model.Potential theme layer will be got in the process of generating text via the EMalgorithm. However, in the process of using the original LDA algorithm generatestexts; the algorithm will treat every word in the text as equally important, this willcause the theme to the high-frequency words tilt, while also causing the thematicoverlap. The main contribution of this paper is three things: First, propose a LDAmodel based on point-wise mutual information, the model is able to solve theproblem of topic tilting to the high-frequency words and themes overlap, allows usto extract the text theme which can better characterize a text. The experiments showthan the algorithm of this topic is feasible. Second, the article also extracted fromtwo perspectives to evaluate the theme. One is according to the readability andconsistency of the topics included the words, the other is according to the distinctiveand independence of the topic from the model. Can clearly see from the experiment,the LDA topic model based on point-wise mutual information is better than theoriginal LDA topic model according to the readability, consistency, or distinction, independence of the theme. Third, a theme exist advantages and disadvantages, asthe characteristics of the text when the topic vector, each dimension of the vectorshould be not same, subject propose LS algorithm based on the shortest distance tocalculate the theme of the right value which been applied to text application.

Keywords/Search Tags:

topic model, Latent Dirichlet Allocation, LDA topic model based onPoint-wise mutual information, topic quality, topic weighted

Related items

1	Research And Application Of Topic Evolution Model Based On LDA
2	News Topic Discovery Research Based On The LDA Model
3	Study Of Text Evolution Analysis And Prediction Based On Topic Model
4	Research On Fast Gibbs Sampling Topic Inference Algorithms For Topic Models
5	Topic Discovery And Trend Analysis In Scientific Literature Based On Topic Model
6	Topic Model Based On Dirichlet Process
7	Study Of Tracking And Detection Technology Based On Online Topic Model
8	Research On Text Topic Modeling Based On Word Embedding
9	Research And Implementation Of Distributed Topic Clustering Technology For Text Flow
10	Contextual Topic Modeling