Font Size: a A A

Sliding Window Top-K Monitoring Over Distributed Data Streams

Posted on:2018-04-02Degree:MasterType:Thesis
Country:ChinaCandidate:Z J LvFull Text:PDF
GTID:2348330512984593Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years,with the rapid development of mobile Internet.Internet of Things,social networking and other areas,the amount of data is growing explosively.In these areas,it is necessary to process massive data which is changing rapidly in real time and efficiently.Real-time processing and analysis of data streams has become a hot topic in the research field of big data.Distributed data stream monitoring is widely studied in many application scenarios,such as network traffic monitoring,sensor network monitoring,web log usage analysis,and stock market surveillance.These application scenarios often require monitoring the anomalies present in the distributed data streams and reporting in time.As a large amount of data is generated rapidly,the traditional centralized approach becomes no longer feasible,because the computing resources and storage capacity are limited at central processing node,incurring calculation delay and heavy communication overhead.In this paper,we study the problem of sliding window top-k monitoring over distributed data streams.It is continuously query the top k data objects with the largest aggregate numeric values within a fixed-size monitoring window.We adopt the continuous distributed monitoring model which contains one coordinator node and some distributed monitoring nodes.Each monitoring node continuously receives data records from an input data stream.The coordinator node is responsible for tracking the global top-k result.In order to maintain and process the data stream efficiently,we adopt the time-based sliding window processing model.The monitoring window is partitioned into several small window units for processing data streams.Whenever a new window unit is created,the oldest window unit is removed due to data expiration.The numeric values of data objects are varied continuously as the window slides.It is necessary to frequently request changed numeric values of data objects from distributed monitoring nodes to compute new global top-k result for continuously querying the top k data objects with the largest numeric values.This results in huge communication overhead and computation consumption in the entire monitoring system.In order to reduce the communication overhead in the distributed monitoring system as much as possible,we propose a reallocation of numeric values of data objects algorithm based on revision factors.This algorithm coordinates the numeric values of data objects among distributed monitoring nodes by assigning the revision factors so that the local top-k results at distributed monitoring node are in line with the global top-k result.When the local top-k results are inconsistent,it is necessary to calculate the new global top-k result and revision factors.Distributed communication is only necessary on occasion,when local constraints are violated at distributed monitoring nodes,so that it can greatly minimize communication cost across the network.Extensive experiments are conducted on top of Apache Storm to demonstrate the efficiency and scalability of our algorithm.
Keywords/Search Tags:Data Stream, Distributed Monitoring, Top-K Query, Stream Processing
PDF Full Text Request
Related items