Font Size: a A A

Adaptive Distributed Data Stream Management System

Posted on:2011-08-30Degree:DoctorType:Dissertation
Country:ChinaCandidate:Mahmoud Sami SolimanFull Text:PDF
GTID:1118330335488973Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Sensor networks are sort of wireless networks that used for environment monitoring, target tracking, structural health monitoring, precision agriculture, active volcano monitoring, transportation, human activity monitoring, and other monitoring applications. These sensor data behave very differently from traditional database sources:they are continuous arrival in multiple, rapid, time varying, possibly unpredictable, unbounded streams, and keeping no record of historical information.These limitations make conventional Database Management Systems and their evolution are unsuitable for streams whereby there was a need to build a complete Data Streaming Management System (DSMS), which could process streams and perform dynamic continuous query processing.In this dissertation, a framework for Adaptive Distributed Data Streaming Management System (ADDSMS) is presented, that operates as streams control interface between arrays of distributed data stream sources and end-user clients who access and analyze these streams. The underlying framework provides stream management and query processing mechanisms to support the online acquisition, management, processing, storage, and integration of data streams for distributed sensor networks. The rise of large-scale monitoring infrastructures such as wireless sensor networks poses distributed query processing challenges; the queries must be processed inside the system in a distributed fashion so that the performance of the typically resource-constrained processing nodes is maximized. Furthermore, to enable high throughput and low latencies in the presence of high-rate data streams, the query processing operators must be placed adaptively across the network to minimize the data movement cost. Three optimization levels are proposed to provide the maximum reduction in resources required to process data steams. This becomes possible due to the simultaneous use of different optimization levels and methods.The first optimization level is sensor deployment. Sensor deployment is a critical issue, as it affects the cost, the amount of data that needs to be processed, and detection capabilities of a wireless sensor network. Although many previous efforts have addressed this issue, most of them assume that the sensing field is an open space. In this work, we consider the sensing field as conditional regions. The sensor location problem (SLP) is a nonlinear non-convex programming problem which aims to locate sensors to monitor a constrained region. The objective is to determine the locations that will maximize the coverage. Three evolutionary algorithms, particle swarm optimization (PSO), genetic algorithm (GA) and adaptive hybrid optimization (AHO) were used to solve the SLP. AHO uses fuzzy logic controller (FLC) as an intelligent switching technique agent between different types of optimization techniques. Several variants (sensing patterns, number of sensors and region constrains) were tested and compared in terms of percentage coverage and computation cost. The results show that the three algorithms are able to obtain good solutions. The proposed AHO proved that it can achieve the benefits of both methods (GA and PSO) and avoid their drawback by smartly switching between them during the optimization process. AHO is capable of accomplishing the best percentage coverage within fewer numbers of iterations in all experiments.The second optimization level is the continuous query distribution. A continuous query on a data stream can be represented as a data flow graph. To run multiple continuous queries in a DSMS; the common approach is to unify their data flow graph in a query plan. A query execution engine executes the plan by implementing the steps represented in the query plan and delivering a set of results for the query. The query plan nodes are sources (data streams), operators (e.g. selection, join), and sinks (user applications); the edges between them represent the data flow. Such schema can be represented as a directed acyclic graph with a one-to-one mapping between vertices and operators and between edges and data flow. The ADDSMS performance depends directly on the good load balancing distribution of the queries between available computing devices and on the minimization of the impact of the communications between them. In this dissertation, we use a suitable cost model for the mapping between the query plan and the directed acyclic graph. This model has been built upon the semantics and implementation of continuous sliding window queries in the Public Infrastructure for Processing and Exploring Streams (PIPES). The PIPES is an infrastructure providing all fundamental building blocks to implement a DSMS. The cost model provides cost formulas for each individual operator. The output stream characteristics of an operator are computed by applying the corresponding operator cost formula to its input stream characteristics. For any query, the model computes all intermediate stream characteristics by starting at the sources and applying the cost formulas bottom up of the query plan. The resulted graph represents the operator's cost and the communication cost between successive operators. A novel ant-based algorithm called Similarity Carrying Ant Model (SCAM-ant) is proposed for the problem of continuous query graph partitioning. The first ant-based clustering algorithm was introduced by Deneubourg et al. Then. Kuntz, Layzell, and Snyers (KLS) have proposed using ant clustering in graph partitioning. The KLS algorithm considers the problem of a graph partitioning through embedding into an Euclidian space or plane. The ant-based clustering model is essentially a dynamical system where vertices are moved in the plane through being carried and propped by ants. Vertices are attracted to or rejected by clusters according to the distance between them. Under SCAM-ant, each ant has some probability of carrying more items while it moves around. The decision is governed by the similarity of carried items and the candidate item. The proposed algorithm SCAM-ant makes use of the group items moving behavior to reduce the time for the complete clustering process instead of the single item moving in the original ant clustering algorithm.In order to provide an appropriate similarity measuring method between graph vertices a modified SimRank is proposed. SimRank is a general algorithm that determines only the similarity of structural context. It applies to any domain where there are enough relevant relationships between objects to base at least some notion of similarity on relationships. The SimRank is combined with the edges'weights to get the final similarity measurement. This final similarity reflects the relationships among operators based on the query structure and the intra data flow between operators.A universal template structure is proposed to be merged with the SCAM-ant. With such mechanisms, the final structure of clusters would closely follow the configuration defined by the templates. The centroid of clusters in the template is defined based on the user requirements and not dependent on the feature space of the data (operators to be cluster).With the purpose to compare the quality of the proposed continuous query distribution algorithm, two metrics are used to judge the distribution quality. These metrics are communication cost and imbalance. The total communication cost between the different clusters reflects the quality of the distribution strategy. The second metric is the imbalance. The imbalance metric reflects the balance between work loads on different processing nodes.From experiment results, clearly SCAM model has a great effect on the performance of the original KLS. Both metrics confirm that the clustering operation using the proposed model can achieve better clusters within less time. The SCAM model is effective; because it can reduce the number of tours which ants need to transport items.Experiment results show that SCAM-ant produce well balanced distributed query plans with minimum communication between them. It was compared to spectral, linear, scattered and multilevel-KL partitioning. The SCAM-ant algorithm is more stable to keep the balancing condition with the increasing number of partitions.The third optimization level is queries'operator scheduling. Previous approaches for scheduling those queries and their operators (e.g. selection, join) assume that each operator runs in separate thread (multi-threaded) or all operators combined in one query plan and run in a single thread like Chain algorithm [124]. Both approaches suffer from severe drawbacks concerning the thread overhead and the stalls due to expensive operators. The development of the original Chain algorithm which is commonly used in some DSMS projects for scheduler has focused only on minimizing the maximum run-time memory usage, ignoring the important aspect of output latency. During bursts in input streams, Chain suffers from tuple starvation, thereby incurring a high latency for these tuples. To overcome these drawbacks, a novel approach called clustered operators scheduling (COS) is proposed. It adaptively clusters operators of the query plan into a number of groups based on their selectivities and computation cost using clustering method called S-mean. S-mean is similarity driven clustering. It is similar to K-means but with some differences. S-mean basically groups all the points to a new cluster whose highest similarity to existing centroids is below the given threshold. In K-means, all points must go to one of the existing K groups, which is unfair to some points when their similarities to corresponding closest centroid are very low. This simple difference makes a considerable impact on the output of clusters. The unspecified k value provides a high degree of adaptation in the clustering process and leaves it for the nature of the data. (This mean we do not need to specify the number of clusters k)A simulation model of the data processing unit has been built using Discrete Event Simulation (DES) for comparing different scheduling algorithms. COS is compared with the conventional scheduling methods FIFO, Chain and multi-threaded. COS proved its high performance for all situations compared with other techniques. Furthermore, COS scheduling performs very well in terms of scalability and robustness. COS also able to use the memory and the computation resources in an efficiency manner that makes it continue works with limited resources, where other techniques lose their stability. Experimental evaluation demonstrates the potential benefits of COS scheduling as an adaptive, flexible, reliable, scalable and robust scheduling technique for continuous query processor.As a final contribution the ADDSMS is a data manager which could be used to process the data streams with adaptive capabilities. Each optimization level provides some degree of flexibility based on the data stream characteristics and the users'queries.
Keywords/Search Tags:DSMS, Continuous query, PSO, GA, Hybrid algorithm, Sensors deployment, Distributed processing, Load balancing, Operators scheduling, Clustering
PDF Full Text Request
Related items