Font Size: a A A

Techniques Of Partition And Query In Data Warehouses Based On Hadoop

Posted on:2013-08-05Degree:MasterType:Thesis
Country:ChinaCandidate:J C QiaoFull Text:PDF
GTID:2268330425997167Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The means of data collection has been huge developed in recent years. So there is the huge amount of data in many enterprises. And how to store and retrieve this data has been more and more important, because of the information hides in this data can guide next strategy in the enterprises. Along with the MapReduce computation model’s coming, the integration between MapReduce and relational data warehouse technologies has become an effective solution to deal with huge data. But the current distributed data warehouses on MapReduce do not take the correlation of data in account. If the data warehouses store the data in correlation, the retrieval performance on these data warehouses will be improved.To address this issue, this thesis has completed three items. Firstly, this thesis designs a multi-dimensional derived horizontal division plan on fact table in data warehouses. This plan divided the tuples in fact table by specified values on reference dimensions. The partitions divided by this plan have correlation inside. This divided plan reduces the amount of Map tasks on cluster. Only the nodes which store the relative partitions with SQL queries need to start Map tasks. Secondly, this thesis has designed and implemented MDChunkDB multi-dimension distributed data warehouse. MDChunkDB is one distributed data warehouse which can deploy on inexpensive large PC cluster. Through the integration of MapReduce and traditional relational database technologies, it took both advantages together. In MDChunkDB, relational databases took charge in storage and MapReduce had responsibility for parallel search operations. In this thesis, there are specific details of each part in MDChunkDB, involved the metadata information, storage strategy, fault tolerance and scalability and so on. Finally, this thesis extended the InputFormat data interface on Hadoop framework in order to achieve the integration of MDChunkDB and MapReduce. So that the Hadoop query retrieval tasks can run on MDChunkDB.At the last, the experiments test the loading performance and retrieval performance on MDChunkDB. Loading performance is worse than that on HadoopDB, but the retrieval performance is better than that On HadoopDB on large size datasets. Moreover, MDChunkDB supports multi-tables join operation on star model. The performance on MDChunkDB is not totally better than that on HadoopDB, so MDChunkDB needs improvements on divided plan and so on.
Keywords/Search Tags:derived partition, MapReduce, Hadoop, parallel search, distribute data warehous
PDF Full Text Request
Related items