Font Size: a A A

HDFS Performance Optimization In Deep Learning Application Scenarios

Posted on:2022-03-18Degree:MasterType:Thesis
Country:ChinaCandidate:Y L BaoFull Text:PDF
GTID:2518306323479264Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
In recent years,the amount of data in many fields has grown at an alarming rate.Collecting,storing and analyzing data is essential for building efficient business solutions.Deep learning is a typical application of high-performance data analysis(HPDA).Most deep learning applications are I/O-intensive,which are different from traditional applications.They have the characteristics of random access,large data sets,extensive overlap,high aggregation,and concentrated hotspots,which place higher requirements on the storage systems.The Hadoop Distributed File System(HDFS)is based on GFS and has been widely deployed in deep learning clusters.Its original design is to use a large number of cheap storage resources to provide storage services with large-capacity and high-performance,which cannot effectively meet the I/O requirements of today's deep learning applications.In response to this problem,this dissertation first proposes a multi-level cache architecture,and establishes a mathematical model of cache costs and benefits,and.then proposes three optimization solutions for joint deployment of caches,large file prefetching,and small file caching for deep learning applications.The specific research work is as follows:(1)This dissertation uses the idea of multi-level caching,and integrates HDDs and SSDs on HDFS Datanodes to meet the requirements of deep learning applications for cloud storage systems for huge space and high bandwidth.At the same time,the utility model of each level of cache is established,and the optimal space size of each level of cache can be solved under the condition of comprehensive consideration of cost and performance.(2)Based on theoretical analysis and real user data,this dissertation describes both the deep learning training set and the user request characteristics,then proposes three optimization strategies:joint caching deployment,large file prefetching,and small file caching.Among them,the joint cache deployment strategy calculates the gain of caching each training set and uses the greedy algorithm to deploy;large file prefetching strategy in the sequential read scenario pre-stores the subsequent part of the file in the memory;small file caching strategy satisfies more I/O requests.In order to evaluate the performance of the proposed model and strategy,a large number of simulations and actual tests have been carried out in this dissertation.The experimental results show that the cache utility model can fit well for cache in a variety of situations,and the maximum error is only 2%.The cache hit rate of the joint cache deployment scheme is significantly better than that of LRU.The average performance improvement of large files prefetching and small file caching strategies exceeds 40%and 90%,which better achieves the optimization goals of HDFS in deep learning application scenarios.
Keywords/Search Tags:deep learning applications, HDFS, cache utility, cache strategy, prefetch strategy
PDF Full Text Request
Related items