Techniques Of Partition And Query In Data Warehouses Based On Hadoop

Posted on:2013-08-05

Degree:Master

Type:Thesis

Country:China

Candidate:J C Qiao

Full Text:PDF

GTID:2268330425997167

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

The means of data collection has been huge developed in recent years. So there is the huge amount of data in many enterprises. And how to store and retrieve this data has been more and more important, because of the information hides in this data can guide next strategy in the enterprises. Along with the MapReduce computation model’s coming, the integration between MapReduce and relational data warehouse technologies has become an effective solution to deal with huge data. But the current distributed data warehouses on MapReduce do not take the correlation of data in account. If the data warehouses store the data in correlation, the retrieval performance on these data warehouses will be improved.To address this issue, this thesis has completed three items. Firstly, this thesis designs a multi-dimensional derived horizontal division plan on fact table in data warehouses. This plan divided the tuples in fact table by specified values on reference dimensions. The partitions divided by this plan have correlation inside. This divided plan reduces the amount of Map tasks on cluster. Only the nodes which store the relative partitions with SQL queries need to start Map tasks. Secondly, this thesis has designed and implemented MDChunkDB multi-dimension distributed data warehouse. MDChunkDB is one distributed data warehouse which can deploy on inexpensive large PC cluster. Through the integration of MapReduce and traditional relational database technologies, it took both advantages together. In MDChunkDB, relational databases took charge in storage and MapReduce had responsibility for parallel search operations. In this thesis, there are specific details of each part in MDChunkDB, involved the metadata information, storage strategy, fault tolerance and scalability and so on. Finally, this thesis extended the InputFormat data interface on Hadoop framework in order to achieve the integration of MDChunkDB and MapReduce. So that the Hadoop query retrieval tasks can run on MDChunkDB.At the last, the experiments test the loading performance and retrieval performance on MDChunkDB. Loading performance is worse than that on HadoopDB, but the retrieval performance is better than that On HadoopDB on large size datasets. Moreover, MDChunkDB supports multi-tables join operation on star model. The performance on MDChunkDB is not totally better than that on HadoopDB, so MDChunkDB needs improvements on divided plan and so on.

Keywords/Search Tags:

derived partition, MapReduce, Hadoop, parallel search, distribute data warehous

PDF Full Text Request

Related items

1	Research On Optimization Technology Of Data Parallel Processing Based On MapReduce
2	Research Of MapReduce Data Skew And Task Scheduling In Heterogeneous Environments
3	Research On Parallel Mining Algorithm Of Space Co - Location Based On Hadoop
4	Study On The Big-data-based Social Media Field Monitoring Technology
5	Research On IPTV QOS Log Analysis Method
6	Research On MapReduce Performance Optimization Based On Hadoop
7	Research On The Clustering Algorithm Of Parallel Partition Based On MapReduce
8	Research And Implementation Of Parallel Clustering Algorithm Based On Approximate Spectrum Hadoop MapReduce
9	Research On Parallel Data Mining Algorithms Based On Hadoop
10	Research On Big Data Processing System Based On MapReduce Parallel Processing Framework