Font Size: a A A

Performance evaluation of big data placement structures in MapReduce-based data warehouse systems

Posted on:2017-02-12Degree:M.SType:Thesis
University:Lamar University - BeaumontCandidate:Hasan, Mohammad RakibulFull Text:PDF
GTID:2468390014473096Subject:Computer Science
Abstract/Summary:
The size of data sets is growing rapidly, which requires fundamentally innovative techniques and technology to capture, store, distribute, and process promptly and cost effectively. Hadoop software framework with high-performance execution engines (MapReduce) is capable of processing large data sets across clusters that provide scalable and fault-tolerant capability on distributed systems. MapReduce-based warehouse system with data storage format is very useful for data summarization and query analysis. The warehouse system can contain millions of row column value and therefore, data placement structure plays a significant role that can influence the warehouse performance. In this research, we examined the performances of Hive's data file formats, the RCFile and ORCFile on top of Hadoop. For this experiment, we design and implement a distributed cluster by three nodes master-slave architecture, where we store and organize the data according to the above files' format structure. We investigate the file format efficiency in terms of data loading, data storage and query processing using MapReduce. The experimental results can lead to choosing the perfect and useful file format for a data warehouse system for Big Data processing.
Keywords/Search Tags:Warehouse system, Big data, Format, Data placement, Data sets
Related items