Font Size: a A A

Hadoop-CC (Collaborative Caching) in Real Time HDFS Thesis

Posted on:2013-11-17Degree:M.SType:Thesis
University:Rochester Institute of TechnologyCandidate:Shrivastava, MeenakshiFull Text:PDF
GTID:2458390008472145Subject:Information Technology
Abstract/Summary:
Data is being generated at an enormous rate, due to online activities and use of resources related to computing. To access and handle such enormous amount of data spread, distributed systems is an efficient mechanism. One such widely used distributed filesystem is Hadoop distributed filesystem (HDFS). HDFS follows a cluster approach in order to store huge amounts of data, it is scalable and works on low commodity. It uses MapReduce framework to perform analysis and carry computations parallely on these large data sets. Hadoop follows the master/slave architecture decoupling system metadata and application data where metadata is stored on dedicated server NameNode and application data on DataNodes.;In this thesis work, study was performed on Hadoop Architecture, behaviour of filesystem and MapReduce in detail and concluded that processing of MapReduce is slow which was further confirmed by initial analysis and experiments performed on default Hadoop configuration. It is known that accessing data from cache is much faster as compared to disk access. Collaborative caching is one such mechanism in which the cache distributed over the clients or dedicated servers or storage devices form a single cache to serve the re- quests. This mechanism helps in improving the performance, reducing access latency and increasing the throughput. This coupled with prefetching enhances the performance.;In order to enhance and improve the performance of MapReduce, the thesis proposes solution of new design of HDFS by introducing caching references, collaborative caching along with prefetching coupled with Modified-ARC cache replacement. Each of the DataNodes would have a dedicated Cache Manager to maintain information about its local cache, remote caches and follow cache replacement algorithm. Initial analysis led to conclusion that caching references too help in improving performance. Modified-ARC helps in organizing the cache in a different way as recent, frequent and history of evicted items which is a better cache replacement policy and improves the execution time and performance of MapReduce. The evaluation of the results were done by comparing the results obtained with that of default configuration in psuedo-distributed and fully distributed mode.
Keywords/Search Tags:HDFS, Collaborative caching, Hadoop, Data, Distributed, Cache
Related items