Font Size: a A A

Research On Chinese Micro-blog Hot Topic Detection

Posted on:2015-05-11Degree:MasterType:Thesis
Country:ChinaCandidate:R T LinFull Text:PDF
GTID:2298330452467847Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of the Internet and the emergence of Web2.0, Microbloghas become an important tool for people to communicate with each other, speech andaccess to news, so it has important practical significance to discover microblogs hottopic timely. But microblog has the characteristic such as with short text, feature wordsrarefaction, as well as large-scale sparse words etc, makes the traditional textprocessing methods cannot applied to the processing of microblogs information.Therefore, how to find out hot topics in microblogs quickly and accurately has becomea hot point in current natural language processing.The main work of this paper includes the following aspects:(1)Aiming the characteristic of Microblogs data with short text, features wordsrarefaction, noisy data, and large-scale document data, using the excellent ability toreduce the dimension of LDA topic models, this paper modeled the microblogs data,which is not only an effective solution to the complexity of the text similaritycalculation, but also avoided the rarefaction problem existed in the traditional textmodeling.(2)K-means algorithm based on Partition is simple, fast convergence, etc., but it isvery sensitive to the number of initial cluster centers K. While the CURE algorithmbased hierarchical clustering is not very sensitive to isolated point, good at dealingwith non-spherical clusters or uneven size with high accuracy. CURE algorithm withhigh accuracy integrated K-means algorithm with high efficiency, this paper uses atwo-stage clustering strategy combined with CURE algorithm and K-means algorithm,so that both solve the sensitive problem to the initial point of the K-means algorithm and obtained the efficiency of clustering.(3) Depth study of the MapReduce programming model and K-means clusteringalgorithm, against the characteristics of K-means clustering algorithm, given themethod of MapReduce programming achieved K-means clustering algorithm, fastclustering the massive microblogs short text data. Through the experimental test, itmade a significant improvement in the efficiency of clustering, and reduced hardwarecosts.(4)Combined the above methods, completed design and implementation offinding microblogs hot topic system, which integrates three blocks--data acquisitionand pre-processing module, a hot topic discovery module and data display module.
Keywords/Search Tags:Microblog, Hot Topic, MapReduce, K-means clustering, Hiddentopic model
PDF Full Text Request
Related items