Font Size: a A A

Replica Selection Based Scheduling For Big Data Real-Time Query Processing

Posted on:2016-08-21Degree:MasterType:Thesis
Country:ChinaCandidate:Y L ZhaoFull Text:PDF
GTID:2308330470467738Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Cloudera Impala is an open-source big data real-time query system. Impala uses HDFS as underlying storage manager. Files in HDFS are broken into block-sized chunks and each chunk is replicated. Data replication provides fault tolerance and load balancing. But scheduling becomes more complicated when data is replicated. Impala’s scheduling contains two steps:replica selection and execution node selection. In replica selection, Impala does not consider the cost of network transmission or the load of the cluster. It may delay response time. Current scheduling method dose not consider data replication. To solve this problem, we proposed a replica selection based scheduling method. The method firstly put all queries into two categories:retrieving data from a single table and retrieving data from multiple tables. If a query retrieved data from a single table, the method constructed a flow network according to data distribution, the SRPushRelabelBinary algorithm was used to select replica, and then selected execution node. If a query retrieved data from multiple tables, the method tried to find near optimal scheduling strategy with a cost model. The cost of query processing was defined as the interval between the starting time of query execution and the estimated time when all join operators finish. The cost contained the execution time of the scan operator, the select operator, the exchange operator and the join operator. The cost model considered communication cost, parallel execution and the load of the cluster. The Maxdiff(V, A) histogram was used to estimate intermediate result and improve the accuracy of the cost model. The proposed method was implemented in Impala2.0, experiments on queries from TPC-DS benchmark indicated that the method can reduce response time by 10%-30%.
Keywords/Search Tags:replica selection, parallel processing, schedule, cost model, big data real-time query
PDF Full Text Request
Related items