Font Size: a A A

Research On Key Technologies Of Data Deduplication For Cloud Environment

Posted on:2014-11-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y J FuFull Text:PDF
GTID:1268330422474292Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the coming of Big Data era, the data capacity in information world is growingat an explosive rate, and the scale of dataset needing storage and management in datacenter can be easily expanded to Petabytes, or even Exabytes. We know the results fromsome previous works: there is large amount of data redundancy exists in the massivedatasets of both backup/archive storage and primary storage. Traditional data backuptechniques and virtual machine image management can magnify the duplication bystoring redundant data over and over again. In order to restrain the excessive datagrowth, improve IT resource utilization, reduce system power consumption and savemangement cost, data deduplication, as a novel data reduction technology, has become ahot topic in academia and industry.As a key technology to support Big Data, cloud computing can optimize resourceefficiency by network computing and virtualization technique, to provide cost effective,high efficient and reliable computing and storage services for users. In cloud backupand virtual desktop cloud environment, deduplication can significantly reduce the re-quirement of storage space, and improve the effficieny of network bandwidth due tohigh data redundancy in these services, but it also brings new challenges on systemperformance. This thesis discusses how to apply deduplication to optimize cloud backupservice in personal computing environment, distributed cloud backup storage system indata center and cluster storage system in virtual desktop cloud, so that the storage spaceefficiency and system scalability can be significantly improved, and the negative impactof deduplication process on I/O performance becomes negligible. In this thesis, afterdeeply understanding the development of current cloud computing technology, we studythe deduplication based cloud backup, Big Data backup and virtual desktop cloud, andpropose creative system designs and novel algorithms. In summary, the main contribu-tions and innovations of this thesis are as follows:(1) Proposes ALG-Dedupe, an application-aware local and global sourcededuplication scheme for cloud backup services of personal computing environment.This thesis firstly discovers that the amount of data shared among different types of ap-plications is negligibly small by conducting a content overlapping analysis on massivepersonal datasets. According to the semantic based application classification, an appli-cation-aware index structure is designed, to improve the efficiency of deduplication byeliminating redundancy in each application independently and in parallel. It can alsoreduce the computational overhead by employing an intelligent data chunking schemeand an adaptive use of hash functions based on application awareness. To balance net-work latency and system overhead in personal devices, ALG-Dedupe combines cli-ent-side local redundancy detection with cloud-side global redundancy detection to im- prove data reduction ratio and reduce deduplication time. The experimental results showthat ALG-Dedupe can improve the deduplication efficiency significantly, shorten back-up window, save cloud cost, and reduce power consumption and system overhead inpersonal computing devices.(2) Designs-Dedupe, a scalable inline cluster deduplication method for Big Databackup. The novelty in our study lies in exploiting both locality and similarity in backupdata streams to optimize cluster deduplication. It combines inter-node super-chunk leveldata routing in cluster with intra-node chunk level deduplication process, to imprve datareduction ratio and keep data locality in each node. Inspired by the generalization ofBroder’s Theorem,-Dedupe is the first application of handprinting in the context ofcluster deduplication to improve the ability of similarity detection. After discount thesuper-chunk resemblances with storage usage in nodes, the handprint based stateful datarouting algorithm assigns data from backup clients to each deduplication server node atsuper-chunk level.-Dedupe builds a similarity index with super-chunk handprints overthe traditional container based locality-preserved chunk-fingerprint caching scheme toalleviate the chunk index lookup disk bottleneck. The backup clients can avoid transfer-ring duplicate data chunk to target deduplication server node over the network by per-forming source deduplication. Finally, we conduct a large number of experiments toshow that-Dedupe can maintain high cluster-wide data reduction ratio, reduce systemcommunication overhead and memory cost, with balanced workload in each node.(3) Proposes a cluster-deduplication based virtual desktop storage optimizationtechnique. To support virtual desktop cloud service, virtual desktop server cluster isneeded to manage large amount of desktop virtual machine. This thesis is the first re-search work to provide a virtual machine scheduling algorithm to optimizededuplication based virtual desktop storage by leveraging semantic awareness in virtualmachine images. Meanwhile, it combines chunk cache in server with local hybrid stor-age cache to improve the I/O performance of deduplication based virtual desktop stor-age. The experiments show that our optimization can improve the space efficiency ofvirtual desktop storage, reduce I/O operations, and enhance the virtual desktop start-upperformance.By studying the above key techniques of deduplication in cloud environment, weprovide a powerful technical support for the future of cloud storage and cloud compu-ting.
Keywords/Search Tags:Data Deduplication, Cloud Backup, Virtual Desktop Cloud, IndexLookup, Data Routing, Virtual Machine Scheduling
PDF Full Text Request
Related items