Font Size: a A A

The Research Of In-memory Data Caching Technology In Map/Reduce-styled Massive Data Processing Platform

Posted on:2014-03-22Degree:MasterType:Thesis
Country:ChinaCandidate:G R LiFull Text:PDF
GTID:2268330392473504Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Map/Reduce-Styled data processing platform is the cutting-edge technology inthe massive data processing field. Different from traditional data processing platform,Map/Reduce-styled platform is provided with new features of distributing data oncomputing nodes and scheduling tasks with data locality, which make it achievepredominant scalability.Good data accessing performance contributes much to guarantee the dataprocessing efficiency in Map/Reduce platform. The existing Map/Reduce platformstore data on disk-based distributed file system which leads to the poor data accessefficiency. In-memory Data Caching is the typical technology in data processing fieldto improve the data accessing efficiency and has been proved to be efficient in datacenter storage system. However, it is a blank space in the Map/Reduce platform.The main work in this paper focus on the in-memory data caching technology forthe open Map/Reduce platform, in which data reuse often occurs inter and intraapplications. Considering of the new feature of scheduling task with data locality, wepropose that the performance goal of the map/reduce-oriented data caching need toadapt from cache hit ratio to map/reduce parallel job execution efficiency. The maincontribution of this paper are followings:1) Considering the new feature of distributing data on computing nodes and therequirement of guaranteeing the data caching integrity for the computing task, theshared memory-based distributed cooperative data cache organization model ispresent. The file split is defined as the data caching granularity.2) The data cache replacement strategy is proposed to pursue the high ratio ofdata processing localization. Focus on the new feature that computing nodes andstorage nodes overlapping, two decision factors of the utilization of computing slotresources and the ratio of local access of file split are introduced in the data cachereplacement. Experimental results show that the average turnaround time ofmap/reduce jobs with the proposed cache replacement strategy has reduced by themaximum19.4%;3) Aiming on the issue that the amount of data with one-off access increases inmap/reduce platform, the data cache prefetching strategy is designed. The prefetchingstrategy choose the computing nodes with the soon-releasing slot resources as thelocation of prefetch data and notifies the task scheduling to deploy the correspondingtasks with the need of the prefetch data on those computing nodes, so as to achieve thedata processing localization. Experimental results show that, via introducing the prefetch strategy, the average execution time of tasks with one-off-access dataprocessing is decreased by the maximum of53.3%;4) Aiming ont the task scheduling efficiency, the data cache-aware taskscheduling strategy based on the FCFS (First Come First Service) scheduling strategyis present. It integrates the data cache scheduling with other resource schedulingseamlessly and schedule task with cache data locality and disk data locality;5) An In-memory data caching prototype system for Map/Reduce platform,called Dacoop, is proposed. Dacoop is designed based on Hadoop, which is an opensource Map/Reduce platform. Experimental results show that Dacoop outperformsHadoop on the average job turnaround time by the maximum of54.4%and theaverage of47.9%.
Keywords/Search Tags:Massive Data Processing, Map/Reduce-Styled Platform, In-memory DataCaching
PDF Full Text Request
Related items