Replica Selection Based Scheduling For Big Data Real-Time Query Processing

Posted on:2016-08-21

Degree:Master

Type:Thesis

Country:China

Candidate:Y L Zhao

Full Text:PDF

GTID:2308330470467738

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Cloudera Impala is an open-source big data real-time query system. Impala uses HDFS as underlying storage manager. Files in HDFS are broken into block-sized chunks and each chunk is replicated. Data replication provides fault tolerance and load balancing. But scheduling becomes more complicated when data is replicated. Impala’s scheduling contains two steps:replica selection and execution node selection. In replica selection, Impala does not consider the cost of network transmission or the load of the cluster. It may delay response time. Current scheduling method dose not consider data replication. To solve this problem, we proposed a replica selection based scheduling method. The method firstly put all queries into two categories:retrieving data from a single table and retrieving data from multiple tables. If a query retrieved data from a single table, the method constructed a flow network according to data distribution, the SRPushRelabelBinary algorithm was used to select replica, and then selected execution node. If a query retrieved data from multiple tables, the method tried to find near optimal scheduling strategy with a cost model. The cost of query processing was defined as the interval between the starting time of query execution and the estimated time when all join operators finish. The cost contained the execution time of the scan operator, the select operator, the exchange operator and the join operator. The cost model considered communication cost, parallel execution and the load of the cluster. The Maxdiff(V, A) histogram was used to estimate intermediate result and improve the accuracy of the cost model. The proposed method was implemented in Impala2.0, experiments on queries from TPC-DS benchmark indicated that the method can reduce response time by 10%-30%.

Keywords/Search Tags:

replica selection, parallel processing, schedule, cost model, big data real-time query

PDF Full Text Request

Related items

1	A Query Optimization Of Embedded Mobile Real-time DBMS Based On Cost Model
2	Study On Replica Selection Strategy In Data Grid
3	Research On Replica Selection Strategy In Data Grids
4	Parallel Query Processing System On Large-scale RDF Data
5	Research Of Replica Management In Data Grid
6	Parallel Query Processing Techniques In Parallel Database System PBASE/2
7	Research And Implementation Of Big Data Real-Time Query Optimization Based On Hypergraph And Bushy-Tree
8	Replica Location Service And Selection Service In Data Grids
9	Research Of Task Partition And Cost Model In Xquery Parallel Impelmentation
10	Data Flow In Adaptive Query Processing Mechanism