Font Size: a A A

Research On The Implementation Of Bursty Events Detection Based On Spark

Posted on:2017-02-02Degree:MasterType:Thesis
Country:ChinaCandidate:K Q ZhuoFull Text:PDF
GTID:2308330485965733Subject:Information Science
Abstract/Summary:PDF Full Text Request
Network information flow contains explicit or implicit bursty events. It is extremely important to detect or predict bursty events from these massive amounts of information flow. Through detecting or predicting bursty events, emergency departments can get well prepared and work out solutions to minimize loss, and ordinary users can deal with the bursty events calmly so as not to enlarge the harm of bursty events, increasing unnecessary stress. The relative researches of bursty events at home and abroad have made many achievements. However, current research is limited to bursty characteristics theory, and technical aspects of detection and prediction researches are few. In particular, outbreak detection and prediction technology of big data environments is even scarcer.The problem to be solved by this paper is "how to accurately and quickly detect the bursty events under the environment of large data in network information ". This problem can be classified into two sub-problems, namely in the big data environment how to accurately detect the bursty evnets from the network information and how to quickly detect bursty events. For two sub-problems, this article first analyzes the related theory and the main technology of bursty detection, and then the big data environment bursty detection model and method are studied, including the following four aspects: (1) this paper discusses the related concepts and technologies of bursty events, bursty events detection, and big data parallel computing; (2) introduces the PLSI, LDA and HDP model based on the theme of the probability generating, and perplexity as the main evaluation index, and analyses the advantages and disadvantages of these models; (3) bursty parallel detection model is put forward to adapt to the big data environment bursty event detection accurately and rapidly; (4) using Yahoo news and Sina microblog two different types of data for empirical research.The main finding of this paper is proposing a parallel bursty event detection model so that the model can be used in large data environments bursty event detection tasks. The parallel model is divided into four steps, namely pretreatment parallel corpus, parallel detection burst words, burst potential parallel text filtering and LDA topic parallel extraction. The model can be executed in parallel on the most popular fast data processing platform, Spark. The parallel detection model used in the actual business forms the parallel detection system. In this paper, a parallel detection system is running on the Spark platform.In addition, this paper does parallel detection model empirical research. Through empirical research of Yahoo news and Sina microblogging two different types of data sets, it shows that the proposed emergency parallel detection model (method) has high accuracy and good scalability. Specific empirical researches are:(1) In the aspect of Yahoo news data set, it mainly does the accuracy experiment of bursty event detection. All the data across multiple dimensions experiment in April, May and June prove that the proposed bursty event parallel detection model accuracte ratio P, recall ratio R and harmonic mean F, respectively up to 84.62%,78.57% and 81.48%. At the same time, it does the LDA experiments and analysis, mainly related to the number of different topics perplexity value and the topic distribution of word in document. (2) In the aspect of MicroBlog data sources, it does efficient experiments about speedup and scalability. And it then does LDA topic extraction module of the longest module in the parallel detection system, including the number of LDA iterations, LDA topic numbers, the partition number of Spark and the use of hardware platform in running Spark.
Keywords/Search Tags:Bursty event detection, Spark parallel computing, Hadoop MapReduce, Big data analysis, LDA topic model
PDF Full Text Request
Related items