Font Size: a A A

Research On Performance Optimization For Solid State Drives With Deduplication

Posted on:2023-01-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:M T LuFull Text:PDF
GTID:1528307172452184Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
NAND flash-based solid-state drives(SSDs)are widely deployed in modern storage products and systems due to their high storage density,high access performance,and low power consumption.Despite the rapid development of SSDs,the capacity of SSDs still cannot meet the storage needs of the growing global data.Besides,SSDs suffer from the endurance issue and the associated reliability concern because of the limited number of program/erase(PE)cycles.Data deduplication reduces unnecessary writes by retaining one copy of the redundant content.The potential benefits of deduplication promote its deployment in SSD-based storage systems to reduce storage costs and prolong the lifespan of SSDs.However,the conventional deduplication architecture focuses on the deduplication overhead caused by fingerprint calculation and fingerprint lookups,ignoring that the changed physical data layout induced by deduplication is likely to have potential impacts on the access patterns.For example,the uneven read distribution among the dies increases the access contention,and the changed data layout cannot fulfill the maximal parallelism due to the fragmented data after deduplication.Besides,in a multi-user environment,the contention for in-memory fingerprint resources increases the deduplication overhead.To address the above problems,several novel techniques for physical layout,data cache and fingerprint management are proposed as follows.The uneven distribution of the highly-duplicated data caused by the conventional dynamic data allocation increases the access contention.To address this issue,a read-leveling data distribution scheme(RLDDS)is proposed.The theoretical analysis and experiments point out that the highly-duplicated data introduced by deduplication gather in some parallel units(dies),resulting in an uneven read distribution and further increasing the access contention.RLDDS uses the number of address references as the read-hotness of the parallel units.The rationale behind it is that multiple referenced addresses improve the access probability of the related data.RLDDS attempts to distribute the highly-duplicated data into different parallel units as uniformly as possible by tracking the read-hotness of the parallel units and choosing one with low read-hotness.In this way,RLDDS reduces the access contention,thus improving the read performance.Experimental results show that RLDDS improves the read performance by up to 21.61%and the overall system performance by up to 18.22% compared to deduplication with the conventional dynamic data allocation scheme.Deduplication is likely changes the data layout and decreases the read parallelism,resulting in the I/O fragmentation problem.To address the above problem,an elastic data cache(EDC)is proposed to migrate the popular fragmented data from flash to the data cache,accelerating the access to the fragmented data.EDC modifies the traditional logical index and designs a hybrid indexing method.Specifically,EDC uses a virtual address to index the fragmented data.This way,the different read requests accessing the same data content can be responded to in the data cache by searching the virtual address,improving the cache hits of the fragmented data.The update writing for the cached fragmented data increases the access to the fragmented data in the flash,but read latency is significantly lower than write latency.Therefore,EDC designs a novel write processing method,and it updates the hit fragmented data only if the evicted data is dirty,shortening the write response time.Extensive experiments validate the efficiency of EDC.Compared to Deduplication and Front-Dedup,EDC improves the read performance by 80% and 79%,respectively,and improves the write performance by 85% and 80%,respectively.As for the overall system performance,EDC improves the system throughput by 89% and 87%,respectively.In multi-tenant scenarios,the general fingerprint cache management is unaware of the users’ characteristics,resulting in the low efficiency of the fingerprint cache.To address this challenge,a novel fingerprint management scheme Cost FM is proposed.Cost FM takes a two-pronged method to manage the fingerprint cache.The first one is the benefit-aware fingerprint cache allocation,reallocating the fingerprint cache resource based on the deduplication benefits generated by each tenant.It uses dynamic programming to obtain the optimal allocation plan with the aim of maximizing the overall fingerprint cache hits,effectively avoiding the resource contention when the fingerprint cache is shared and the low resource utilization when the fingerprint is equally portioned.The second one is the user-based policy selection scheme.It tracks the access patterns and uses a decision-making model to adjust the optimal management strategy for each user dynamically.Based on the benefit-aware fingerprint cache allocation scheme and the user-based policy selection scheme,Cost FM significantly improves the efficiency of the fingerprint cache,reducing the deduplication overhead caused by fingerprint lookups.Compared with the four different deduplicated SSD systems SHARED(LAP),SHARED(RAP),EUQAL(LAP),and EQUAL(RAP),Cost FM decreases the average request latency by 30%,9%,17%,and 14%,respectively.The fingerprint cache hit ratio of Cost FM is improved by 33%,13%,25% and 24%,respectively.And the fingerprint writes into the flash memory are reduced by 2.8x,6.1x,2.8x and 6.2x,respectively.
Keywords/Search Tags:Solid State Drive, Data Deduplication, Access Contention, Fragmentation, Fingerprint Management
PDF Full Text Request
Related items