Font Size: a A A

Research Of Data Deduplication Technology On Hadoop Distributed System

Posted on:2016-10-19Degree:MasterType:Thesis
Country:ChinaCandidate:S H YuFull Text:PDF
GTID:2298330467479683Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, the production and copy of data is expanding at an amazing pace. The huge amount of data requires more storage capacity, processing power and network bandwidth. More and more data is stored in the cloud servers. However, when stored, a lot of duplicated data not only occupy much storage space, but also affect storage efficiency. To address this problem, data deduplication technology has provided with a good solution, which can optimize storage, and reduce the waste of physical storage space to meet the growing demand for data storage.However, the traditional data deduplication will produce additional cost and redundancy, reducing the10performance. And with the increase of the amount of data, the retrieval speed of fingerprint will slow down, which impact the performance of data storage. To solve above problems, this thesis develops and optimizes the deduplication technology on chunk level. It can improve the efficiency of storage both in space and time, and also can be applied to the Hadoop distributed platform.At first, this thesis has a research on the data deduplication technology and its application on distribution, and also makes an analysis about its features and current situation. On this basis, the thesis improves the content defined chunking algorithm, proposing a new incremental fingerprint algorithm DRabin and an improved algorithm TDOB based on algorithm TTTD, so as to raise the chunking speed and deduplication ratio, respectively. Next, these algorithms have been applied in Hadoop distribution system. Based on Hadoop, a deduplication system is designed and constructed, and system perfomance optimize is done. Finally, the numerical experiment results show that above algorithms can improved the storage performance significantly.
Keywords/Search Tags:Deduplication, HDFS, Hash Algorithm, Cloud storage
PDF Full Text Request
Related items