Font Size: a A A

Research On SSD Reliability Method Optimization Based On Spatio-temporal Characteristic Of Errors In NAND Flash Memory

Posted on:2021-02-10Degree:DoctorType:Dissertation
Country:ChinaCandidate:S Z WangFull Text:PDF
GTID:1488306107456514Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Over the past decade,SSDs(Solid State Drivers)has rapidly grown in popularity within personal computers and data centers,thanks to its high internal parallelism,small random-access latency,low static power consumption,and small form factor.With the development of semiconductor process technology,Multiple Level Cell technology and application of 3D stack technology,storage density of NAND flash memory has greatly increased.However,the higher storage density comes at a cost of reduced storage reliability.Unreliable storage media could lead to high overhead for maintaining data integrity,even permanent loss of valuable data.Thus,it is an important topic to study the reliability algorithm applied in SSDs to improve algorithm efficiency and prolong SSD lifetime.Existing reliability algorithms typically guarantee data storage integrity at the cost of heavy performance overhead and/or storage overhead.For example,SSDs configure with an Error Correction Code(ECC)engine to correct the bit errors of stored data and utilize Redundant Arrays of Independent Disks(RAID)to provide system-level fault tolerance protection.However,decoding latency of ECCs during the error detection/correction greatly degrades the read performance of the system.The heavy parity data and data reconstruction operations in RAID will cause high performance overhead.What's more,in order to cope with the increased demand for flash storage reliability due to the expansion of flash storage density,SSDs use overlong ECCs whose parity data size exceeds the system configuration parity space to improve the error correction capability of ECCs.However,the read amplification caused by prolonging ECC parity greatly reduces the read performance.To make full use of the multi-level parallelism in SSDs,the controller usually links blocks with the same block ID across all parallel units into superblocks,and manages the flash space with the superblock as the granularity to maximize the system throughput and make all blocks within the same superblock suffer from the same Program/Erase cycles.However,the uneven wear resistance difference between the flash memory blocks leads to premature damage to the weak blocks in the superblock,which accelerates SSD failure.When SSDs fail,a large number of blocks are not fully utilized,resulting in reducing the utilization of SSDs.In order to solve the above two problems,the following reliability optimization studies are carried out around spatio-temporal characteristics of errors in NAND flash memory:Aimed at read performance reduction cased by high decoding latency and read amplification in traditional overlong ECC storage schemes,this dissertation first explores the temporal characteristic of errors in NAND flash memory(that is raw bit error rates of block increase at an exponential rate with Program/Erase cycles growing.),then it analyzes the relationship between the required ECC parity redundancy and that configured in NAND flash memory.There are two key observations: Configured ECCs are under-utilized at the most stages of lifetime;Configured ECCs fail to meet reliability requirements at the end of lifetime.A lifetime adaptive ECC(LAE for shot)based on the temporal characteristic of errors in NAND flash is proposed.The central idea of LAE is to adaptively adjust ECCs during the whole lifetime of NAND flash.At the early stages of lifetime(that is,the Program/Erase Cycles that blocks have suffered from is relatively small),LAE makes the best of the under-utilized configured ECC space and employs ECCs with as short codewords as possible to reduce decoding latency,result in improving read performance;At the end of lifetime(that is,the Program/Erase Cycles that blocks have suffered from is relatively high),LAE utilizes overlong ECCs whose sizes are beyond the configured ECC parity space by extending user data to provide strong enough ECCs to ensure data integrity,which efficiently utilizes intra-SSD parallelism.Results show that LAE can improve read performance up by 85.1% compared to traditional ECCs and LAE can improve read performance up by 30.0% compared to the state-of-art overlong ECC schemes.Aimed at two issues with the traditional RAID technology used in SSDs: There is high performance overhead caused by vast parity writes and data rebuilding and there is a risk of data loss,this dissertation first explores the relationship between the varied reliability requirements and provided reliability of configured RAID.There are some findings that the provided reliability is under-utilized during the most stages of lifetime.Inspired by the temporal characteristic of errors in NAND flash memory(Blocks with the same P/E cycles suffer from different BERs),a wear aware RAID stripe management based on the temporal characteristic of errors in NAND flash memory(WARD for short)within SSDs is proposed.On the one hand,WARD dynamically manage RAID stripes according to the real-time wear of blocks to reduce the negative effects of parity write on performance,and prevent more than RAID recoverable error-prone chunks from remaining in one RAID stripe against data loss.On the other hand,WARD migrates blocks about to break in advance and leaves these blocks unused to reduce data rebuilding overhead.Results shows that WARD provides high and stable reliability during the whole lifetime and improves read and write performance up by 25.6% and 19.5% compared with the traditional RAID scheme.To improve the lifetime of SSDs employing traditional superblock organization schemes,this dissertation first explores the inter-block unbalanced BER distribution and inter-page unbalanced BER distribution.A dynamical superblock management scheme based on spatio-temporal characteristics of errors in NAND flash memory(WAS for short)is proposed.WAS employs inter-page unbalanced BER distribution to efficiently measure block wear in real time.WAS dynamically organize superblocks according to real-time block wear to make strong blocks relieve weak blocks.What's more,WAS utilizes a wearbased garbage collection scheme to further relieve the wear gap among blocks.Results show that WAS can greatly improve SSD utilization and lifetime by 30.78% and 51.3% compared to the traditional superblock organization at the cost of negligible performance reduction.
Keywords/Search Tags:Solid State Drivers, Reliability, Error Correction Code, Redundant Arrays of Independent Disks, Superblock, Spatio-Temporal Characteristics of Errors in NAND Flash Memory
PDF Full Text Request
Related items