Font Size: a A A

Research On Data Deduplication Techniques

Posted on:2014-02-26Degree:MasterType:Thesis
Country:ChinaCandidate:J R ZhangFull Text:PDF
GTID:2248330398460149Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the increasing competition of globalization, the enterprises must focus on their core business and maximize their competitive advantages to seek long-term development. Concerning the business in which they are not professional, they may need the help of manufacturing services and capabilities from other enterprises. Thus resource sharing and business collaboration between enterprises becomes much more significant than ever. With the continuous development of the Internet, networked manufacturing appears. To a certain extent, it achieves automation of transactions, breaks through the geographical space constraints on the scope and manner of business and reduces the cost of enterprise. But some problems, such as fixed service model and inability to implement the allocation and sharing of resources dynamically and so on, still exist in current networked manufacturing domain, which limit its further promotion.Cloud storage has become increasingly prevalent owing to its scalability, convenience and cost-effectiveness. It offers the basis for a slate of file hosting services, which provide users the ability to store their files on the servers, and synchronize the files between servers and their devices. Under this scenario, the physical storage facilities supporting this service are usually highly centralized. Data de-duplication techniques, or called duplicate detection methods are distinctly highlighted for massive replica.Traditional data de-duplication methods just stress their works on reducing the storage space consumption; however, as a remote storage system, the network overhead also cannot be disregarded especially when the system is applied over WAN. We propose a new duplicate detection algorithm, which revises the data index and owns a new duplicate block match method. To evaluate its performance, we implement a network file system prototype called DDSN. Our new method can reach the storage space consumption of the sliding blocking method, which is the front-runner of all known duplicate detection algorithms. In the meantime, our method overcomes its defect that the whole file needs to be transmitted over the network and saves massive bandwidth for duplicate data. In addition, a new file structure is developed to fit the data organized by fixed-size blocks.Apart from that, naive file synchronization demands the whole file to be transmitted to all other places (devices and servers) whenever this file is modified in one location. This leads to huge waste of network bandwidth and heavy delays in propagating the modification. We design a new method called HadoopRsync, which is able to perform incremental update for files rather than transmitting them in entirety. Our method is based on the Rsync originally designed for file synchronization between two machines, but the scenario under consideration is substantially different from that for Rsync in that in the cloud storage condition, files are usually distributed stored on multiple nodes of the cloud. Consequently we propose a pair of methods called HadoopRsync Download and Upload, which are respectively accountable for the synchronization from the cloud to the users’ devices and the synchronization in the contrary direction. Our methods only send the differences between the new version of the file and the old one, rather than the entire file. The algorithms are based upon the open-source framework Hadoop, which can distribute process very large data across clusters of computers. Our solution utilizes the Hadoop MapReduce facility to fully taking advantage of its massive-parallelization faculty. Moreover, we present some optimization means to reduce the Hadoop file system I/Os required by file update. Massive experiments are conducted to estimate the proposed methods, which present that our HadoopRsync significantly outperforms the baseline algorithms.
Keywords/Search Tags:data de-duplication, duplicate detection, network file system, rsync, Hadoop, cloud storage
PDF Full Text Request
Related items