Font Size: a A A

Research And Implementation Of Data Placement And Query Techniques Based On MapReduce In Distributed Multi-Dimensional Data Warehouse

Posted on:2014-04-03Degree:MasterType:Thesis
Country:ChinaCandidate:Y MaFull Text:PDF
GTID:2308330473453860Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With Internet applications and computer technology extends to all aspects of human social life, the volume of data showing explosive growth. Today, the storage and the processing of big data, as well as very-large-scale data, have become new challenges for enterprises. Therefore, many methods have been proposed to meet the requirement of big data processing. Within these methods, Cloud Computing is an outstanding one. The main thought of cloud computing is performing large-scale data analysis tasks on large clusters of low-end hardware instead of expensive high-end servers. With the increasing development of Cloud Computing technology, more and more applications will be transferred to the Cloud, including Database Management System(DBMS). However, the ACID(Atomicity, Consistency, Isolation, Durability), four required characteristics of DBMS may lead to poor performance, especially for the join operation, if the data is stored in a distributed system.To solve these problems, the thesis designed a distributed multi-dimensional data warehouse system(DMDWH), by integrating MapReduce and the relational database. DMDWH are composed of five components:client, metadata database, query engine, data loader and Hadoop extended cluster. In data placement, we proposed three kinds of efficient storage strategies:the full copy of table strategy, the independent horizontal partition strategy, and the joint horizontal partition strategy. The related data is assigned to the same node as much as possible, which greatly increase the opportunity for join operations at local host. In this way, it avoids the across nodes of the network communication overhead and data transmission costs. In query optimization, we added complier, optimizer, generator and executer to form a query engine. Besides, we optimize the generator based on cost calculation, which can generate optimal execution plan. Finally, we extend the interface of InputFormat and OutputFormat in Hadoop, which achieves the parallel operation for data input/output from DBMS in real sense.The distributed multi-dimensional data warehouse(DMDWH) which been extended can give full play to the advantage of the RDBMS and MapReduce computing architecture, making the perfect combination of the index query optimization techniques of RDBMS and MapReduce’s parallelism, ease and scalability feature. Finally, the experiment shows that the system architecture has good performances of load, query and fault-tolerant, and it can provide a faster and more efficient parallel query for data warehouse applications.
Keywords/Search Tags:MapReduce, DBMS, Distributed Data Warehouse, Independent Horizontal Partition, Joint Horizontal Partition
PDF Full Text Request
Related items