Font Size: a A A

Reactive Scheduling For Online Analytical Processing Over Hadoop/HBase Cluster

Posted on:2015-06-03Degree:MasterType:Thesis
Country:ChinaCandidate:Dieudonn MuhetoFull Text:PDF
GTID:2298330434454000Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Abstract:Data warehouses and OLAP (OnLine Analytical Processing) systems allow quick access and synthetic large volumes of data for analysis. It is in this sense that data warehouses are essential tools for BI (business intelligence). The Hadoop/HBase clusters in particular provide important resources for processing and storage. The deployment of a warehouse on Hadoop/HBase cluster infrastructure, however, requires the adaptation of the multidimensional model and OLAP process to reflect the distribution of data and aggregates.The multidimensional model exploits the horizontal fragmentation of aggregates tables and column oriented nature of the HBase database. The availability of data, especially in this contest of distributed data is not trivial. And that’s why this thesis introduces a model for identifying data of warehouse and a method of indexing data as multidimensional blocks. The search of available data in distributed warehouse requires an unambiguous identification of the data. This model leverages how the OLAP queries reference the data in warehouse. Those data are subdivided into chunks and because of large amount of data the notion of chunks blocks is defined.An index structure to support these two previous models is based on index of blocks of chunks and indexes in cuboid lattices, and allows the location data materialized on different cluster nodes. The first index, namely Cuboids Index is structured in lattice and distinguishes aggregates at different levels. The second index, namely Chunks Block Index is based on indexing of the content of the chunks block. Finally, the CCB index is the combination of these two indexes. The proposed reactive scheduling strategy leverages the maintenance operations in batch mode. It also deals with the online queries processing and build an optimized execution plan from the list of candidate blocks that contribute to the query result. The query processing has to rewrite the client query and locates useful data for the query. OLAP query, initially expressed like SQL, is translated as chunks block identifiers in order to locate the requested data via the querying of the CCB index. And it is during the location the data contributing to the result of a query and corresponded chunks blocks are searched on the cluster.This thesis presents a prototype with services designed to manage the data in distributed warehouse on the HBase.This prototype allows us to test the feasibility and performance of our research. This aim has been achieved by implementing a Data Warehouse/ETL/OLAP that leverages HDFS/HBase and MapReduce as respectively DW/OLAP backing store and ETL process. The two kinds of workloads are both managed by the reactive scheduling engine. And about the test, batch processing comparisons and real-time processing comparisons gave very satisfactory resultants and demonstrate the feasibility of our approach. Therefore, in order to have better performance, improvements can be necessary in every elements presented in this thesis.
Keywords/Search Tags:Business Intelligence, Data warehouse, Hadoop/HBase Cluster, Online Analytical Processing, Reactive Scheduling
PDF Full Text Request
Related items