Font Size: a A A

An Research On Language Topic Mining Based On LDA

Posted on:2019-04-07Degree:MasterType:Thesis
Country:ChinaCandidate:L MaoFull Text:PDF
GTID:2417330563993062Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
In recent years,with the development of text knowledge and the breakthrough of Internet technology,on the one hand,people have begun to try to make computer achieve more deep natural language tasks,such as intelligent customer service system,based on keyword search and so on.On the other hand,how to learn and understand the nearly millions of latent semantics of human text automatically.It has become a hot issue in the research.Since Blei has proposed Latent Dirichlet Allocation(LDA),its paper has been quoted thousands of times and is widely used in various fields such as search engine,recommendation system,network and atlas,advertising prediction and so on.The dimension of "theme" is put forward.On the one hand,it understands the latent semantics of human language text,and also realizes the reduction of the document from word space to theme space,and it removes the noise caused by some invalid words.This paper will focus on the LDA model and its application.The main work is as follows:First,the mathematical theory of probability and statistics related to the model,including Bayesian statistics,multinomial distribution,Dirichlet distribution,conjugate prior distribution and expectation calculation,is introduced,and the word vector representation,word bag hypothesis and PLSI model are described in turn.Secondly,it expounds the basic principles and essence of LDA topic model.It uses an implicit variable that obeys Dirichlet distribution to represent the subject distribution of the document,and constructs a sampling process of three layers of Bayesian probability distribution to simulate the generation of documents.In this paper,VEM and Gibbs methods are used to estimate and compare the parameters.Finally,the LDA model is applied to the unsupervised text topic mining project.The research object is the more than 10 thousand selected articles from the web crawler.First,the text preprocesses such as word segmentation,disuse words and so on;this paper uses TF-IDF to calculate the weight of the document,and draws the word cloud image after the data is sparse and dimensionally reduced,and constructs the LDA model to measure the model effect according to the complexity and the logarithmic likelihood index and select the final number of subjects.By comparing the VEM and Gibbs methods,the Gibbs method is proved to be effective and consumes long training time.Finally,calculating the similarity of words and topics,and according to the input vocabulary recommendation document,it is proved that the text recommendation based on topic mining is appropriate and feasible.
Keywords/Search Tags:Text mining, topic model, LDA, VEM, Gibbs sampling
PDF Full Text Request
Related items