Research On Indexing And Query Over Data Streams

Posted on:2019-10-03

Degree:Master

Type:Thesis

Country:China

Candidate:Z J Lu

Full Text:PDF

GTID:2428330566983443

Subject:Computer Science and Technology

Abstract/Summary:

Massive data streams from sensors in Internet of Things(Io T)and smart devices with Global Positioning System(GPS)are now flooding to database systems for further processing and analysis.The capability of real-time retrieval from both fresh and historical data turns out to be the key enabler to the real world applications in smart manufacturing and smart city utilizing these data streams.However,state-ofthe-art solutions,e.g.HBase,do not render satisfactory performance,due to the high overhead on index update.Time series databases,e.g.Druid do not render satisfactory performance as well.They do not render efficient range queries over non-temporal attributes due to the lack of secondary range indexes.In this paper,we present a simple and effective distributed solution to achieve millions of tuple insertions per second and ad-hoc temporal range query processing in milliseconds.In this paper,we propose a new data partitioning scheme that takes advantage of the workload characteristics and avoids expensive global data merging.Furthermore,to resolve the throughput bottleneck,we adopt a template-based index method to skip unnecessary index structure adjustments over the relatively stable distribution of incoming tuples.Our solution fully exploits the limited computation power and network bandwidth by running traditional B+ tree indices over sharedHDFS architecture.The insertion operations only involve reads over intermediate nodes in the tree,consequently facilitating highly concurrent updates and queries with only minor contentions on leaf pages.To parallelize data insertion and query processing,we propose an efficient dispatching mechanism and effective load balancing strategies to fully utilize computational resources in a workload-aware manner.To evaluate the performance,we evaluate our prototype system with a lot of experiments and demonstrate the performance.First,we evaluate the indexing performance and data chunk size.Next,we evaluate adaptivity of our system.Finally,we compare the overall performance with state-of-the-art open-source system.On both synthetic and real workloads,our system consistently outperforms state-of-theart open-source systems by at least an order of magnitude.The main reason is the bilayer index architecture.What's more,template-based B+tree significantly reduces indexing maintenance overhead.Query dispatch algorithm and load balancing can utilize the computational resources.

Keywords/Search Tags:

data streams, index, query processing, distributed system, template-based B+ tree

Related items

1	The Processing Strategy For Data Streams Based On Sliding Window In Simulation Platform
2	Research On Keyword Query Approach Over RDF Data Based On Tree Template
3	Reseach On K-Skyband Based Top-K Dominating Query Over Distributed Data Streams
4	Research On Distributed Parallel N-of-N Skyline Query Processing Technology Over Data Streams
5	Query Processing Techniques For Large-scale Product Knowledge Graphs
6	Research On Techniques And Systems For Index And Query Optimization Of Big Data
7	Research On Distributed Parallel Skyline Query Processing Technology Over Uncertain Data Streams
8	Adaptive Processing Of Ad-hoc Queries On Data Streams
9	Real-time Entity Resolution And Query Processing Based On Region-tree Indexing
10	Research On Key Technologies Of Distributed Rank-aware Query Processing