Font Size: a A A

Research On Data De-duplication Technology In Network Backup

Posted on:2011-02-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:T M YangFull Text:PDF
GTID:1118330332968061Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Today, the ever-growing volume and value of digital information have raised a critical and mounting demand on large-scale and high-performance data protection. The massive data needing backup and archiving has amounted to several perabytes and may soon reach tens, or even hundreds of perabytes. Despite the explosive growth of data, research shows that a large number of duplicate data exists in the information processing and storage of all aspects, such as file systems, e-mail attachments, web objects, and the operating system and application software. Traditional data protection technologies such as periodic backup, version file system, snapshot and continuous data protection magnify this duplication by storing all of this redundant data over and over again. Due to the unnecessary data movement, enterprises are often faced with backup windows that roll into production hours, network constraints, and too much storage under management. In order to restrain the excessive growth of data, improve resource utilization and reduce costs, data de-duplication technology has become a hot research topic.Due to the continued growth of data and high-continuity of application, it's very important to ensure that the system has good performance and scalability while performing data de-duplication in a large-scale network backup system to improve storage space efficiency. Therefore, our work mainly focuses on data de-duplication performance and scalability. A distributed hierarchical data de-duplication architecture based on centralized management is presented, and then the metadata management, index maintenance, scalable and high performance data de-duplication technology are researched in detail. The main contributions of this dissertation include:To overcome the shortcomings of existing de-duplication solutions, which obtain high backup performance, but suffer from poor scalability for large-scale and distributed backup environments because of their single-server architecture, we present a distributed hierarchical data de-duplication architecture based on centralized management for network backups. The architecture supports a cluster of backup servers to perform data de-duplication in parallel, and uses a master server, which handles job scheduling, metadata management and load balancing to improve the system's scalability. Data stream is transfered directly from the client to the backup server, deduplicated in-batch and then sent to the back-end storage nodes, which separates the control flow from the data flow effectively. Multi-layer data indexing technology supports high-performance hierarchical data de-duplication and dynamic expansion of both backup server and storage node, which provides the system good performance, manageability and scalability.Exising de-duplication technologies lookup fingerprints in the global system to eliminate duplicate while writing data to the back-end storage, with the growth in the amount of data, the memory overhead for accelerating fingerprint lookup will grow increasingly, and thus the system physical capacity will be limited inevitably by the amount of physical memory that the server can offer. This paper implemented an in-memory fingerprint filter based on small-scale detection, which is deployed in the backup process to eliminate duplicate generated by periodic backups. The designed filter limits the fingerprint lookup within the scope of the job chain, so, its memory overhead is independent with the system scale. In addition, it collects fingerprints during backup, which enables high-performance post-processing duplication; and thus avoiding the impact of time-consuming disk index acess on application system. Experiments show that, by using this filter, most of the duplicate in backup can be eliminated, which in turn improves overall system performance by reducing not only the bandwidth requirement for backups but also the number of chunks needing to be further processed in the background.Data that was processed by the in-memory fingerprint filter is further deduplicated using a post-processing duplication algorithm in the background. The algorithm processes a large number of fingerprints in a single pass over the disk index, thus effectively eliminating the disk index assess bottleneck. In addition, the algorithm protects the logical order of new chunks by storing them in fixed-size containers, which enables high-performance data recovery. Containers are distributed to back-end storage nodes using a stateless routing algorithm, which supports load balancing, data migration and dynamic expansion in the back-end storage. Experiments show that, compared to current mainstream data de-duplication technology, the algorithm supports larger system physical capacity in the same memory overhead, more importantly, it supports multiple servers to perform de-duplication storage in parallel, and thus applicable to large scale and distributed environments.Post-processing data de-duplication algorithm improves fingerprint lookup by performing a sequential scan over the entire disk index, so it's favorable to improve the disk index utilization to enable a smaller disk index for a given system scale. While the relevant aspect of research on disk index utilization is still not found in literature, we implemented the disk index as a disk-resident hash table based on prefix mapping, and studied the index utilization through both theoretical analysis and extensive experiments. The results show that using appropriate-sized buckets, the disk index utilization can be effectively improved with acceptable CPU overhead, which in turn not only reduces metadata storage but also improve index scan performance.
Keywords/Search Tags:Data backup, De-duplication, Disk index, Fingerprint lookup, Index update, Post-processing
PDF Full Text Request
Related items