Font Size: a A A

Research On Big Data Platform Architecture And Application

Posted on:2018-02-04Degree:MasterType:Thesis
Country:ChinaCandidate:D C ZhangFull Text:PDF
GTID:2348330536979646Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In recent years,with the development and application of cloud computing,Internet of things,mobile Internet and other new technologies,so that human enters the era of big data.Since after the emergence of the concept of big data,distributed data processing technologies are proposed,more and more large distributed data processing framework are emerged in the industry,from the earliest Hadoop MapReduce to Spark and Storm etc,which correspond to different data processing methods and business scenarios.At present,although the research and application of off-line batch analysis of large data has been quite mature,but more and more fields are putting forward to the requirements of real-time analysis and rapid response to the fast and massive stream data.To the data stream query optimization,network traffic monitoring,network security,data compression and so on,cardinality estimation online has important application value.Some probabilistic algorithms have been developed to estimate the number of dataset by scanning the static historical data with an acceptable standard error of low qutio.However,due to the infinite,fast,real-time characteristic of data stream so they can not be applied to an infinite data stream.In view of the problem above,this thesis aims to research the Storm and Spark Streaming distributed stream computation system,and existing probabilistic algorithms,design an Improved HyperLogLog Algorithm of Re-Count in Stream Data Based on Storm and Spark Streaming,the main work are as follow:First of all,this thesis studies the typical batch processing Hadoop,in-memory calculation Spark and stream processing computation framework Storm,including platform architecture,data calculation model and frame security,and summarizes the similarities and differences of three kinds of big data processing techniques.At the same time,this thesis analyzes the key technologies of stream data processing and the importance of estimating the number of unique values in stream platform.Second,based on the study of big data platform and traditional estimation algorithms,and the traditional estimation algorithms can not be applied to stream data re-count in big data,a flow platform re-count model based on HyperLogLog algorithm are proposed.The model consumes data from the Kafka system and estimates the number of distinct elements in the Storm and Spark Streaming processing engines.In order to compute the cardinality of stream data effectively,a sliding window mechanism is added and the parallel implementation of the Hyper LogLog algorithm is designed.Finally,this thesis designs and implements the Hyper LogLog parallelization algorithm on the stream platform.The experimental results show that the performance of the Hyper LogLog parallelization algorithm implemented by the stream platform is greatly improved under the premise of ensuring the accuracy of the algorithm.The HyperLogLog parallelization algorithm implemented on the Storm platform has a lower data processing latency than Spark Streaming,but has less throughput than Spark Streaming.
Keywords/Search Tags:Big Data Processing, Stream Data, Re-count, Sliding Windows, Paralleling Algorithm
PDF Full Text Request
Related items