Font Size: a A A

Extending For Topic Model Used In Web Data Mining

Posted on:2014-01-03Degree:MasterType:Thesis
Country:ChinaCandidate:X Q QiFull Text:PDF
GTID:2248330398970626Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Microblog has become a very popular platform for the exchange of information. For the short text data of microblog users, the method of representing the text with traditional used word as features can not measure the similarity between them, because the probability of the same word joined in two different short texts is small.Aiming at sparse high-dimension problem of microblog, topic model is widely researched in text clustering, of microblog. Latent Dirichlet Allocation(LDA) is a classic representative of the topic model. Author Topic Model(ATM), which is an effective extending of LDA, is also used to the same purpose.There are two disadvantages while ATM is used. The one is that a word in an article is generated according to only one author’s topic multinomial distribution, the other one is that ATM doesn’t take into account of the inside structure information of microblog.The main work done in the article is following.1) Research and analysis on a variety of topics model, and achieve text dimensionality reduction based on LDA, ATM.2) To solve these two problems, an improvement on ATM is resented, and the new model is called ULLDA. In each document generation process, after the author is chosen from the author list, the corresponding theme distribution is no longer decided only by the author, but according to the characteristics of the microblog, the relevant person corresponding theme distribution can also affect the distribution, so as to overcome the above two faults.3) The proving is given based on the dataset of NLPIR. The modeling method of LDA, ATM, ULLDA were used for modeling, and the results were compared. It proves that ULLDA is useful for the text clustering of microblog and it can improve the performance of ATM.
Keywords/Search Tags:topic model, latent dirichlet allocation, datamining, dimensionality reduction
PDF Full Text Request
Related items