Font Size: a A A

Research On Parallel File Systems Based On Heterogeneous Hierarchical Storage

Posted on:2019-04-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:X LiuFull Text:PDF
GTID:1368330623450358Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Nowadays,supercomputers are developing rapidly.Meanwhile,the scale of data-intensive applications and big data applications are expanding quickly.With the growth of the scale of supercomputers and these kinds of applications,it brings new and significant challenges to HDD-based parallel file systems that are widely used in HPC environment.In HPC envionrment,the most widely adopted system architecture is that storage subsystems and computing subsystems are separated.It increases I/O latency.As the compute nodes in supercomputers usually do not have local HDDs[1,2]and it is hard to install SSD in all compute nodes,massive local I/O with busty and aggregated characteristics flow into the globally shared,parallel file systems.Hence,the parallel file systems are facing huge I/O pressure.The number of CPU cores in a supercomputer has reached tens of millions[2,3],which may generate tremendous amount of aggregated I/O requests.Current research and real applications indicate that while the parallel file systems with a single HDD-based storage hierarchy are providing large storage capacity,it is hard to fulfill the requirements of high parallelism,high bandwidth and low latency from Exascale supercomputers.In this paper,based on the TH-1A supercomputer,HPC applications and big data applications,we study the requirements of file systems from Exascale supercomputers,and propose new parallel file system architecture and the key implementation techniques.The main contributions of this paper are as follows:?1?We proposed the parallel file system architecture of ONFS based on hierarchical hybrid storageThe HDD-based parallel file systems that are widely used in current supercomputers usually only have a single storage hierarchy.Since this kind of storage server is located far away from the compute nodes and is restricted by the poor performance of HDD,the storage system can only provide huge storage space but low speed and high latency.SSD-based Burst Buffer Nodes?BBN?and I/O Nodes?ION?construct local file systems.They can only provide file data read/write to partial compute nodes,and there is no integration with the underlying HDD-based storage system.Nowadays,typical parallel file systems can hardly fulfill the requirements from Exascale supercomputers.By studying the I/O requirements of future exascale applications,we proposed the globally shared hierarchical hybrid parallel file system,ONFS,with HDD-based,SSD-based and DRAM-based storage tiers.It can provide parallel file read/write with high bandwidth and low latency through DRAM-based and SSD-based storage tiers that are close to the compute nodes,and massive data storagevia HDD-based storage tier.Files can be migrated among three storage tiers effectively under a single namespace,and ONFS supports POSIX standard.Comparing to the commonly used typical parallel file systems,ONFS is the first system that can achieve huge storage capacity,high parallelism,high bandwidth and low latency simultaneously,and it can fulfill the requirements of parallel file system from Exascale supercomputers.?2?We proposed distributed metadata division,storage and management policies based on User-group Subdirectory?UGSD?The efficient metadata storage and management is the basis to satisfy the requirements of parallel file systems from Exascale supercomputers.It includes metadata division,storage,management and services.The main methods for metadata division are static subtree partition,dynamic subtree partition and hash partition.Static subtree partition has large granularity,and it is hard to support load balance and dynamic scale adjustment.Though the division granularity of dynamic subtree partition is small,the relationship among subtrees is complex,the overhead of describing and managing subtrees is large.Hash partition discards the intrinsic relationship among directories,and metadata migrate is inevitable while modifying the name of files and directories.Based on the construction of user directories,we propose the distributed metadata storage and management under User-group Subdirectory?UGSD?division unit.It maintains the intrinsic tree structure of directories,and simplifies the description and management of metadata division.By adding a natural integer suffix to UGSD,it is evenly distributed as an input variable.A simple MOD function and a lookup table are used to build the mapping between UGSD and MDS,and between MDS and MDSS.We propose a peak-shaving MDS with a synchronous updating policyto support dynamic load balancing and the scale adjustment of metadata servers.By analyzing and comparing UGSD with other typical metadata management methods,it indicates that the metadata partitioning strategy of UGSD is closely related to real utilization characteristics.It divides metadata in a more reasonable way and it is easy to desctibe and manage metadata.It has simple mapping algorithms.The mapping between file path and metadata servers are evenly distributed and it is easy to achieve.UGSD can support dynamic metadata load balancing and scale adjustment of metadata servers.As a summary,UGSD solves main technical issues in metadata division,storage and management.?3?We proposed the memory borrow and return strategies for DS-m storage tier,parallel storage control and overall performance optimization methodIn supercomputers,the memory of compute nodes is dedicated to user programs.How to get available free memory is the key point in building high speed and low latency storage tier based on DRAM.To our best knowledge,near all research work on building the storage tier based on memory in compute nodes in supercomputers skated over this important,fundamental issue.Based on the real situation of memory utilization of user programs,we divide all compute nodes into the Full-memory Partition and the Small-memory Partition,and borrow certain amount of memory from nodes in the Small-memory Partition statically.According to the dynamically changing status of memory utilization of user programs,we propose the Maxiam Dynamic Following method to further borrow remaining memory from nodes in the Small-memory Partition.By combining static and dynamic memory borrow and return policies,the memory resources can be returned as soon as the program requires,hence to ensure the correct execution of user programs.Our method first solves the key problems of where to borrow memory and how to managem it in building the storage tier based on nodes'memory.Currently,existing methods of allocating storage space are disk-oriented and are not suitable for DS-m.DS-m that is built on memory in compute nodes has small memory capacity,which affects the storage of large files.The external communication bandwidth of a single DS-m is restricted by the bandwidth of the interconnecting interface,affecting multi-process parallel read/write bandwidth.As DRAM is a volatile device,dual-replicas methods are usually leveraged to solve storage reliability problems.However,updating dual-replicas in serial may introduce high latency.The control strategies of VFS page cache are suitable for small blocks in HDD,and they have low performance when dealing with large file read/write.In addition,FUSE layer divides large I/O requests into multiple small requests,hence introducing long latency in data transferring.To solve aforementioned problems,we combine multiple DS-m and DS-s as a Group to work in parallel to enlarge the storage capacity of DS-m and increase the aggregated bandwidth of multi-process read/write.We leverage dual-replicas with parallel updating to eliminate the write latency of serial updating.We disabled VFS page cache and increased the MAXsize in FUSE to construct and manage Client-side cache,which speedup read/write operations dramatically.Experiments and analysis indicate that by grouping 4 DS-m,it can provide 4 times storage capacity and an average 3.4 times read/write bandwidth as a single DS-m.The time for parallel replica updating is only 48.8%of that of serial replica updating.The read and write bandwidth with Client-side cache is 6.7-fold and 1.78-fold as with VFS page cache,respectively.?4?We proposed downward migration based on the threshold of memory capcity,and upward pre-migration based on applications'characteristicsEfficient and dynamic file migration control is one of the most important techniques in improving the performance of hierarchical hybrid storage systems.It includes downward migration and upward migration.Downward migration mainly uses available storage capacity as one of the migration conditions,and upward migration takes access features,such as read/write and access request size as the parameters to calculate data heat.Most existing solutions are for low-speed HDD,and do not consider the access features of high performance applications.Using the dynamic characteristics of file access to calculate file heat has expensive cost,and using only available storage capacity to control downward migration without considering file open/close status may easily introduce the ping-pong effect to opened files.In this paper,we divide files to be migrated downward into two categories based on their open/close status,and propose a LRU list to calculate file coolness.We set up three thresholds for available memory capacity,and control downward migration together with file coolness.We propose whole-file and partial-file migration guanularities and upward active/negative pre-migration based on file read/write/process features of data-intensive applications.The experimental results and analysis reveal that the file coolness has small computing cost,and we can gain optimizations in the banalced performance of file wrting in and migrating out.In addition,active upward pre-migration can reduce ineffective upward migration,which in effect can increase migration benefits by 16 times and more.We implemente ONFS prototype system on TH-1A supercomputer through FUSE,and it supports POSIX interface.User applications can run on ONFS without modifying the program.IOR benchmark tests reveal that the read/write bandwidth of ONFS is at least 7.7 times as that of Lustre file system.By running typical data-intensive applications on both ONFS and Lustre,the read and write bandwidth of ONFS is 5.44-fold and 4.67-fold as that of Lustre,respectively,which is a big improvement.
Keywords/Search Tags:High Performnace Computing, Storage System, File System, Hierarchical Hybrid Storage, Metadata Management, Data Migration
PDF Full Text Request
Related items