Research And Implementation Of Data Placement And Query Techniques Based On MapReduce In Distributed Multi-Dimensional Data Warehouse

Posted on:2014-04-03

Degree:Master

Type:Thesis

Country:China

Candidate:Y Ma

Full Text:PDF

GTID:2308330473453860

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With Internet applications and computer technology extends to all aspects of human social life, the volume of data showing explosive growth. Today, the storage and the processing of big data, as well as very-large-scale data, have become new challenges for enterprises. Therefore, many methods have been proposed to meet the requirement of big data processing. Within these methods, Cloud Computing is an outstanding one. The main thought of cloud computing is performing large-scale data analysis tasks on large clusters of low-end hardware instead of expensive high-end servers. With the increasing development of Cloud Computing technology, more and more applications will be transferred to the Cloud, including Database Management System(DBMS). However, the ACID(Atomicity, Consistency, Isolation, Durability), four required characteristics of DBMS may lead to poor performance, especially for the join operation, if the data is stored in a distributed system.To solve these problems, the thesis designed a distributed multi-dimensional data warehouse system(DMDWH), by integrating MapReduce and the relational database. DMDWH are composed of five components:client, metadata database, query engine, data loader and Hadoop extended cluster. In data placement, we proposed three kinds of efficient storage strategies:the full copy of table strategy, the independent horizontal partition strategy, and the joint horizontal partition strategy. The related data is assigned to the same node as much as possible, which greatly increase the opportunity for join operations at local host. In this way, it avoids the across nodes of the network communication overhead and data transmission costs. In query optimization, we added complier, optimizer, generator and executer to form a query engine. Besides, we optimize the generator based on cost calculation, which can generate optimal execution plan. Finally, we extend the interface of InputFormat and OutputFormat in Hadoop, which achieves the parallel operation for data input/output from DBMS in real sense.The distributed multi-dimensional data warehouse(DMDWH) which been extended can give full play to the advantage of the RDBMS and MapReduce computing architecture, making the perfect combination of the index query optimization techniques of RDBMS and MapReduce’s parallelism, ease and scalability feature. Finally, the experiment shows that the system architecture has good performances of load, query and fault-tolerant, and it can provide a faster and more efficient parallel query for data warehouse applications.

Keywords/Search Tags:

MapReduce, DBMS, Distributed Data Warehouse, Independent Horizontal Partition, Joint Horizontal Partition

PDF Full Text Request

Related items

1	Auto-sharding Technique And Algorithm For Distributed Relation Database Based On SQL History
2	Research And Application Of The Partition Technology In Real-time Data Warehouse
3	Research And Application Of The Partition Technology In Real-Time Data Warehouse
4	Research On Data Partition Optimization Method Of Shared-Nothing Relational In-Memory Database
5	Research On Virtual Partition Strategies Of A Shared Storage Distributed Database
6	Techniques Of Partition And Query In Data Warehouses Based On Hadoop
7	Research On Partition Selection Strategy For Big Data Management Based On KNN Connection Processing
8	Based On Python, Mysql Database Application Layer Level Partitioning Technology
9	Hbase Based Credible Dataware Construction Of Business Quarterly And OLAP Query Analysis
10	Research On Switching Control Method For The Gymnastic Robot On Horizontal Bar