Font Size: a A A

Research On Real-Time Query Processing In Cloud Computing For Terms In Data Streams

Posted on:2017-11-05Degree:MasterType:Thesis
Country:ChinaCandidate:S M ZhengFull Text:PDF
GTID:2348330503496016Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology, more and more domains spring up, such as news, blogs and social applications. Real-time query processing technology of terms in data streams is widely applied in many fields including search engines and social networks. Many existing query processing technology assumes that term set is known, but the size of term set is usually unknown under the background of big data. Also traditional centralized query processing methods does not consider problems of data partitioning and merge methods. So that precision and performance of them become poor in the distributed environment.This paper aims at several kinds of common queries in data streams of terms and puts forward some new real-time processing techniques and the corresponding query algorithms. The main research work and contributions are summarized as follows:(1) Most algorithms for Top-K term query in data streams use fixed storage space and find the top-k most frequent terms under the condition of known term set. But it is unable to be satisfied in many applications. To solve this problem, we presented a Top-K term query algorithm based on the dynamic summary under the framework of Spark Streaming. The algorithm uses a method of data partition and optimizes the update strategy. Also it uses little storage space. We introduced a method to merge query results. The Top-K query result under the condition of unknown term set has higher precision.(2) Previous studies of bursty term query count and save all terms without consideration of hot terms. Under the background of exploding in the data scale, it makes more sense to get bursty time of them. To solve this problem, we presented a distributed bursty ter m query algorithm based on a numerical discrepancy model. It uses dynamic update strategy and a checkpoint mechanism to extract hot terms. Then burstiness scores of them are estimated based on a model of burstiness. It finds firstly bursty time of terms. Finally it finds bursty time in linear time according to burstiness of all query terms.(3) Based on the framework of Spark Streaming, we designed and achieved a system for real-time query processing of terms in data streams. The system supports Top-K term query and bursty term query in this thesis. It can efficiently processing, storage and query according to terms. What's more, it has a certain scalable ability. For the real-time query processing system of terms designed, we made a detailed introduction on system design idea, system architecture and the design and realization of the various modules in this thesis.
Keywords/Search Tags:Data Streams, Cloud Computing, Real-Time Processing, Top-K Term Query, Bursty Term Query, Spark Streaming
PDF Full Text Request
Related items