Font Size: a A A

A Study Of Frequency Queries Of Data Streams Based On Sketch

Posted on:2024-05-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y Q LiuFull Text:PDF
GTID:1528306932457484Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Since entering the era of big data,how to use limited memory resources to perform frequency estimation in streaming data items with high arrival rates and massive scale has become the focus in many applications.For example,network measurement requires fast frequency statistics and analysis of massive packets on network devices such as switches to achieve task goals such as network anomaly detection,congestion regulation,and load balancing.And with the rapid growth of network traffic,how to perform real-time and more accurate frequency estimation of streaming data in such resource-constrained network devices is an important research topic.Compared with other techniques,sketch,as a probabilistic data structure with a high compression ratio,is widely used for streaming frequency estimation because of its small space occupation,fast update speed,and high estimation accuracy.Traditional sketch uses a hash function to map data items into a compact data structure consisting of counters.However,hash conflicts can cause multiple different data items to be mapped into the same counter,which in turn causes estimation errors in sketch.Moreover,as the number of data items in the data stream increases,the data conflicts in the counters continue to accumulate,which leads to larger estimation errors.Therefore,current sketch techniques are still challenging in handling large-scale data streams.To alleviate these problems,this paper proposes XY-sketch.It estimates the frequency values by estimating the probability of the data items to appear in the data stream.The single directionality of this probability estimation allows XY-sketch to always estimate only the frequency values of the target data items,and thus alleviates the above problems by maintaining a high estimation accuracy in the face of a surge in data volume.The data stream frequency queries studied in this paper include:Point queries,Top-k queries,Heavy Change queries,and Top-k range queries.The main contents and contributions are as follows:1.This paper proposes a novel structure for point queries in data streams,called the XY-sketch,which estimates the frequency of a stream item by estimating the probability of this item appearing in the data stream.The framework associated with the XY-sketch consists of two phases,namely decomposition and recomposition phases.A data item is split into a set of compactly stored basic elements by a bijective function,which can be recomposed in a probabilistic manner for query evaluation during the recomposition phase.Meanwhile,the error bounds of XY-sketch are analyzed and proved,and two aspects of optimization are performed:spatial optimization and bijective function optimization.The experimental results show that XY-sketch has high estimation accuracy,especially when the space is small,and the estimation error of XY-sketch is one order of magnitude smaller than the existing state-of-the-art.2.This paper extends XY-sketch to support Top-k queries in data streams.Traditional methods based on sketch have two core steps in implementing Top-k queries:identifying hot items by a sketch and storing the identified hot items precisely with an additional auxiliary structure.The sketch part often occupies a large storage space to ensure the high accuracy of the queries.The proposed method in this paper can identify hot items more accurately even with a small space budget,which improves space utilization and overall estimation accuracy.Meanwhile,this paper analyzes and proves the error bound of the proposed method,and designs a series of experiments to evaluate the performance.The experimental results show that the method has good estimation accuracy.3.This paper extends XY-sketch to support heavy change queries in data streams.The characteristic of culprit items is that their frequencies are high in at least one window.Therefore,some state-of-the-art only store the hot items in the data stream and exclude the cold items.Thus,some data items are misreported as culprit items due to the overestimation of the frequency difference,which reduces the estimation accuracy.In this paper,the proposed method can estimate the frequencies of cold items accurately even with a small space budget,thus improving the accuracy in calculating the frequency differences,leading to improve overall accuracy.In this paper,the error bound of the proposed method is proved.Meanwhile,the experimental results show that the method has good estimation accuracy.4.This paper proposes a novel PE(Probe-and-Explore)framework to implement Top-k range queries in data streams.The PE framework is based on the sketch to obtain approximate estimation results quickly by limited resources.It is divided into two phases:the probing phase and the exploration phase.In the probing phase,it identifies hot dyadic subranges in the data stream to approximate a hot range,thus narrowing down the scope of the exploration phase.In the exploration phase,it retrieves the neighbors of those explored hot dyadic subranges to expand them into complete hot ranges in response to the queries.Following the framework,we devise two structures,multi-level accumulated counter(MLAC)and multi-level decayed counter(MLDC).The former extends sketch-based solutions on dyadic range queries to support the queries.The latter further optimizes the performance by applying the exponential decay strategy to eliminate the space overhead on cold dyadic ranges.Meanwhile,we analyze and prove the error of the proposed method.Experimental results show that the precision of MLAC can reach 80%or more,while the precision of MLDC can reach 90%or more.In this paper,XY-sketch is designed to implement a point queries for data streams,which estimates the frequency value of a data item by estimating its probability of occurrence in the data stream.This allows XY-sketch to maintain high estimation accuracy even when the storage space is small or the data volume is surging.Meanwhile,I further extend XY-sketch to support Top-k queries and heavy change queries,and propose a PE framework based on sketch technology to implement Top-k range queries for data streams.The experimental results show that each method in this paper has good estimation accuracy.
Keywords/Search Tags:Data Streams, Frequency Queries, Sketch, Probabilistic Data Structures
PDF Full Text Request
Related items