Font Size: a A A

Research On High-Performance Duplicate Detection And Elimination

Posted on:2013-11-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:J S WeiFull Text:PDF
GTID:1228330392455477Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
As the world moves into the digital universe, the era of big data has arrived. Datadeduplication has attracted recent interest in the research community, and it plays a vital roletoday in improving the space efficiency of disk-based storage systems and affordablymanaging the explosive growing data. Data deduplication is the process of identifying andeliminating duplicate contents in a data set or byte stream for improving the efficiency of datastorage and/or transmission. In general, deduplication can be carried out at one of the fourlevels of granularity, namely, whole files, byte strings, fixed-size blocks, or variable-sizedchunks. To enhance the performance of deduplication storage systems, it becomes importantto study how to efficiently define, detect and eliminate duplicate data objects.To meet the high-throughput requirement of modern dual-level deduplicationapproaches, a pipelined dual-level fingerprinting (PDF) scheme that can fingerprint a datasetboth at the file level and at the chunk level in a single scan of the contents is proposed. PDFbreaks the fingerprinting process into task segments and further leverage the computingresources of modern multi-core CPUs to pipeline the time-consuming operations.Experimental results show that PDF can achieve significantly higher throughput thantraditional multi-threaded approaches.Detecting duplicates over sliding windows is an important technique for monitoring andanalysing data streams. Since recording the exact information of elements in a sliding windowcan be RAM-resource-intensive and introduce an unacceptable search complexity, severalapproximate membership representation schemes have been proposed to build in-memory fastindices. However, various challenges facing RAM utilization and scalability remain. ADetached Counting Bloom filter Array (DCBA) is proposed to flexibly and efficiently detectduplicates over sliding windows. A DCBA consists of an array of detached counting Bloomfilters (DCBFs), where each DCBF is essentially a Bloom filter that is associated with adetached timer (counter) array. The DCBA scheme functions as a circular FIFO queue andkeeps a filling DCBF for accommodating fresh elements and a decaying DCBF for evictingstale elements. DCBA allows the timer arrays belonging to fully filled DCBFs to be offloadedto disks to greatly improve the memory space efficiency. The fully filled DCBFs will remainstable until their elements become stale, which allows a DCBA to be efficiently replicated forthe purpose of data reliability or information sharing. Further, DCBA can be cooperativelymaintained by clustered nodes, which provides a scalable solution for mining massive data streams. Mathematical analysis and experimental results show that DCBA significantlyoutperforms existing schemes in memory efficiency and scalability.Membership representation and determination is important for storage systems thatmanage very large data sets. A Dynamic Bloom filter Array (DBA) aimed at representingmembership for variable large data sets in storage systems in a scalable way is proposed.DBA consists of dynamically created groups of space-efficient Bloom Filters (BFs) toaccommodate changes in set sizes. Within a group, BFs are homogeneous and the data layoutis optimized at the bit level to enable parallel access and thus achieve high query performance.DBA can effectively control its query accuracy by partially adjusting the error rate of theconstructing BFs, where each BF only represents an independent subset to help locateelements and confirm membership. Further, DBA supports element deletion by introducing alazy update policy. We prototype and evaluate our DBA scheme as a scalable fast index in theMAD2deduplication storage system. DBA is shown to excel in performance, scalability,query accuracy, and space efficiency by theoretical analysis and experimental evaluation.There are two challenges facing scalable high-throughput deduplication storage. Thefirst is the duplicate-lookup disk bottleneck due to the large size of data index that usuallyexceeds the available RAM space, which limits the deduplication throughput. The second isthe storage node island effect resulting from duplicate data among multiple storage nodes thatare difficult to eliminate. Existing approaches fail to completely eliminate the duplicateswhile simultaneously addressing the challenges. Therefore, a scalable high-throughput exactdeduplication approach called MAD2is proposed to improve the storage efficiency ofnetwork backup systems. MAD2eliminates duplicate data both at the file level and at thechunk level by employing four techniques to accelerate the deduplication process and evenlydistribute data. First, MAD2organizes fingerprints into a Hash Bucket Matrix (HBM), whoserows can be used to preserve the data locality in backups. Second, MAD2uses DynamicBloom filter Array (DBA) as a quick index to quickly identify non-duplicate incoming dataobjects or indicate where to find a possible duplicate. Third, Dual Cache is integrated inMAD2to effectively capture and exploit data locality. Finally, MAD2employs a DHT-basedLoad-Balance technique to evenly distribute data objects among multiple storage nodes intheir backup sequences to further enhance performance with a well-balanced load.
Keywords/Search Tags:Data Deduplication, Variable-Sized Chunking, Sliding Window Model, Duplicate Detection, Fast Index, Data Backup
PDF Full Text Request
Related items