Font Size: a A A

Research On Shuffle Technology Of Separation Of Computing And Storage In Big Data System

Posted on:2021-08-23Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y HuFull Text:PDF
GTID:2518306104488224Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Shuffle is the bridge connecting the mapper side and the reducer side.The reliability and performance of the shuffle service directly affect the execution efficiency of the application.The existing shuffling mechanism aggregates data in memory,which is prone to generate data spills and cause write amplification.When the reduce task pulls data,it will generate a large number of small,random I/O requests,I/O queue waiting time and disk seek time occupy a large part of the entire disk service time overhead.D-Shuffle is an efficient shuffling service that separates computing and storage to solve the above problems.It sends the data calculated by multiple mapper sides to a distributed shuffle service process that is specifically responsible for shuffling.The shuffling process uses a mixed memory layout of Dynamic Random Access Memory(DRAM)and Nonvolatile Memory(NVM).The key is placed in DRAM,the value is placed in NVM,the data sent from multiple mapper sides are merged,sorted as needed,and finally written to a distributed file system.This process reduces the data spill of the computing node,and allows the reducer side to pull data from multiple mapper sides in fewer seeks when seeking data.At the same time,the distributed file system ensures the reliability of the shuffle data.Considering that shuffle data may be lost under extreme conditions,D-Shuffle designed an interruptible re-compute mechanism,which reduces the overhead of recalculation.D-Shuffle is implemented on Spark.The experimental results show that the performance of D-Shuffle is significantly better than Spark's existing shuffle mechanism.D-Shuffle can avoid write amplification in the mapping phase,reduce recalculation overhead by 37% on average,and improve the end-to-end job performance by 23%-33%.
Keywords/Search Tags:Big data system, Shuffle, Separation of computing and storage, Non-volatile memory
PDF Full Text Request
Related items