Font Size: a A A

Research On Hot Topics Detection Methods For Chinese Microblog Based On Hadoop

Posted on:2017-06-30Degree:MasterType:Thesis
Country:ChinaCandidate:W C WangFull Text:PDF
GTID:2348330542987025Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Nowadays,the microblog has become an important platform for information sharing and communication.By the end of March 2016,the number of daily active users of Chinese microblog has reached 120 million,and many events which were influential to society spreaded out from microblogs.Extracting hot topics is significant to government's guiding public opinion,decision of business managers and personal daily life.To extract hot topics from mass microblog data accurately,this thesis has done some research on the following aspects.First of all,this thesis implements customized crawler program,which uses a method of user attention and technology of login simulation to crawl,parse and save microblog data.After saving microblog data,this thesis transforms the complex characters and then carries on corresponding preprocessing of data depending on the type of microblog.The microblog data crawled is a case study of the experiment and data source of the system.This thesis puts forward two methods of calculating the popularity of microblogs to choose hot microblogs,to solve the problem of choosing microblogs which have nothing to do with hot news.The popularity of microblogs has a positive correlation with commenting numbers,praising numbers,forwarding numbers and user attention numbers.This thesis combines these numbers as a method of calculating popularity of microblogs.Meanwhile,this thesis proposes another method to calculate popularity of microblogs based on the tfidf algorithm and word frequency's changing of the time period before and after.Secondly,this thesis models the microblog data with the LDA model,using microblog topic model to express microblog vector,with whom this thesis solves the problem of high-dimension and sparseness with a traditional method.This thesis implements parallel gibbs sampling algorithm based on MapReduce to solve the problem of gibbs sampling algorithm's slow convergence.Finally,this thesis proposes a microblog document clustering algorithm named BHK-means,which finds global optimum initial cluster centers of K-means with black hole algorithm,solving the problem that K-means algorithm is easy to fall into local optimum.This thesis implements parallel BHK-means algorithm based on MapReduce to solve the problem of low efficiency when processing the massive microblogs.Then a method is proposed which is on extracting topic words from hot microblog clusters,with combing LDA and popularity of microblogs.The experiment result shows that the ratio of forwarding microblogs is significantly increased with popularity calculation methods,which proves these two methods are effective.Besides,modeling the microblogs based on LDA performs better than traditional method in clustering precision and gibbs sampling algorithm based on MapReduce achieves reasonable speedups.BHK-means algorithm has better clustering precision and BHK-means algorithm based on MapReduce achieves reasonable speedups,and the method combineing LDA and popularity of microblogs is able to extract accurate hot topic words.
Keywords/Search Tags:microblog, hadoop, hot topic, black hole algorithm, LDA model
PDF Full Text Request
Related items