Font Size: a A A

Microblog Events Detection And Tracking Based On RIHDBSCAN Using Cloud Framework

Posted on:2015-03-20Degree:MasterType:Thesis
Country:ChinaCandidate:N HanFull Text:PDF
GTID:2268330422972479Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Microblog, a social network that users can record what happening, participate indiscussion through mobile phone, SMS and so on, have a fiery development and aprofound impact in recent years. The vast amount of real-time Microblog contents agreat wealth of information not only offers significant practical significance, such as,risk warning, situational awareness, to analysis and mining,but has brought newchallenges to text mining.In order to mine and analysis the Microblog text streams, domestic and foreignscholars have done some work and made certain achievement. Nevertheless, sheeramount of Microblog data with rapid growth require new scientific methodologies.Cloud computing,the future trend, is able to efficiently complete the task of mass datastorage and computing, so combining the cloud computing technology andmicroblogging mining is imperative.This paper designed a complete event detection and tracking model using CloudFramework. The main contents of this thesis and innovations are as follows:①Filter Microblog texts according to some mechanized filtering rules formulatedin advance, which will improve the efficiency of subsequent processing.②Propose a new term-weighting approach, which is named Forward CommentFrequency-Dynamic Ieverse Document Frequency(FCF-DIDF) based on traditionalTerm Frequency-Inverse Document Frequency. The FCF-DIDF, which has consideredthe increasing scale of the Microblog text set, can improve the performance of TF-IDF,and is applicable to process short text.③Propose an Incremental Hierarchical DBSCAN based on representative posts(RIHDBSCAN) based on DBSCAN. The RIHDBSCAN algorithm is divided into threesteps: generate initial clusters, merge initial clusters, chose representative posts. Duringthe Execution of RIHDBSCAN, there are just part objects need to do core objectjudgment, which will greatly reduces the I/O overhead, and shields the influence of datainput order.At the end of each detection round, a set of representative posts will be selected tojoin the incremental clustering. Through the changes of the cluster structure andkeywords of each round, we can track the events.④Considering a single node can not quickly and efficiently deal with the sheer amount of Microblog texts, deploy the algorithm on Hadoop platform. The model of thefour parts, including Microblog text filter, FCF-DIDF dynamic weight calculation, thecosine distance calculation, RIHDBSCAN clustering, are parallelized deployed inMapReduce framework.The experiment on real world Microblog data extracted from Sina Microblogplatform shows the FCF-DIDF achieves higher performance than TF-IDF and UF-ITUFetc., and the use of Cloud Framework obtains reasonably good performance. It isSuitable for data analysis and mining on huge datasets.
Keywords/Search Tags:Microblog, events detection, DBSCAN, Cloud computing, repressentativeposts
PDF Full Text Request
Related items