Font Size: a A A

Key Technologies And Applications On Real-time Streaming Data Analysis

Posted on:2016-08-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:D Y YangFull Text:PDF
GTID:1108330503993767Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In the age of booming Web, massive data with various formats and types can be acquired in a speedy fashion. However, conventional data processing systems can neither handle huge amounts of data in a cost effective way nor deliver accurate and timely query solutions. As a direct consequence, how to quickly and reasonably deal with such Internet-scale of streaming data poses a grand challenge to traditional computing systems, which in the meanwhile creates potential opportunities to showcase the latest advances in distributed computing paradigms.In this thesis, we leverage Distributed Stream Processing System to cope with streaming data.The distributed processing scheme runs in cluster mode and can rapidly yield outcomes, leading to real-time analysis and decision-making. Specifically, we concentrate on three tasks of streaming data:(1) We develop an online aggregation algorithm over a distributes stream system. It can fast handle large-scale of streaming data with high concurrency. Real-time aggregation results and statistical analysis are continuously sent to users.(2) We develop a novel long-term prediction approach over streaming data. It can predict the future tendency according to current pattern.(3) We build CANDS, a distributed stream processing platform for continuous optimal shortest path queries. It provides fast response to traffic updates to guarantee the optimality of answering shortest path queries. The details of our work are presented as follows:1. Online aggregation, extending the batch model of traditional query processing, can progressively refine the aggregation outcome when more and more data are tackled. On top of that,the corresponding statistical analysis can be constantly displayed to users. Nonetheless, existing distributed platforms like Map Reduce often raise some critical issues. It is hindered by its processing speed in keeping up with ongoing real-time data events, since the intermediate results of a Map Reduce job should be stored on disk. Moreover, the estimation cannot be provided in real-time. Another problem is that it is difficult to generate random samples from heterogeneous data streams, which is significant for online aggregation. In order to handle these problems, we propose an online aggregation algorithm over distributed streaming data.First we adopt a distributed weighted random sampling algorithm to solve biased distribution between heterogeneous streams. Then we apply Actor model to solve complex logical problems and asynchronously process distributed streaming data. An incremental processing mechanism is designed to support one-pass process, which can fast process data in memory and update the results to achieve high performance. Furthermore, a multi-level query processing topology is developed to direct the stream processing when facing multiple new queries. Each query can be decomposed into a set of independent tasks which communicate with each other by sending and receiving messages. The topology can reduce the number of overlapping operations. We have implemented our presented system and conducted extensive experiments over the TPC-H benchmark to corroborate the effectiveness and scalability of our new model in terms of accuracy and the rather low response overhead in contrast with those of Map Reduce.2. Long-term prediction is an essential yet difficult topic over streaming data. In some emergence systems, long-term prediction is more valuable than short-term prediction, since longterm prediction can allow more time to make preparations and detect anomalous events.However, traditional predicting approaches are not sufficient for long-term prediction. We then introduce a long-term predicting approach via pattern matching over streaming data.By mining similar patterns from historical streaming data, we can predict long-term streaming behavior in the future. We also utilize a machine learning algorithm, Adaboost, to solve the pattern-length problem, whose optimal number is difficult to be sought. Finally, we deploy our approach in a distributed stream system to improve the predicting performance. In the implementation of our approach, we adopt multiple optimal strategies to enhance the efficiency. Empirical studies consistently deliver that our system can rapidly achieve pattern matching and long-term prediction. Besides, it can complete large numbers of online prediction tasks in a short time.3. Recommending real-time shortest paths for vehicles is a substantial challenge for any modern road network system. Existing solutions rely either on a centralized index system with tremendous pre-computation overhead, or on a distributed graph processing system such as Pregel that requires much synchronization effort. However, the performance of these systems degenerates with frequent route path updates caused by continuous traffic condition change. Thus we build CANDS, a distributed stream processing platform for continuous optimal shortest path queries. CANDS provides an asynchronous solution to answering a large quantity of shortest path queries so as to efficiently detect affected paths and adjust their paths with respect to traffic updates. Moreover, the affected paths can be quickly updated to the optimal solutions throughout the whole navigation process. Experimental results demonstrate that the performance for answering shortest path queries by CANDS is two orders of magnitude better than that of GPS, an open-source implementation of Pregel. In addition, CANDS provides fast response to traffic updates to guarantee the optimality of answering shortest path queries.
Keywords/Search Tags:Distributed Stream System, Real-time Processing, Online Aggregation, Pattern Matching, Long-term Prediction, Shortest Path
PDF Full Text Request
Related items