Font Size: a A A

Research In Data-deduplication Based On Storage System

Posted on:2017-04-27Degree:MasterType:Thesis
Country:ChinaCandidate:L ChenFull Text:PDF
GTID:2308330488497132Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The explosive growth of digital information has brought great challenges to people’s daily life, even the business operations of enterprises. With the number of redundant data becomes larger and larger, the cost of the backup increases rapidly and the performance of the storage system reduces greatly. It becomes important to study how to efficiently eliminate the duplicate data objects and reduce the storage costs. In recent years, data deduplication is widely used in the storage system and it becomes one of the most active areas of computer applications. But the existing studies of deduplication storage system are lack of deep studies on the accuracy detecting of the similarity data and the reduction of the costs of storage device accesses, which results in the performance of system not high enough to meet the high speed and achieve good duplicate removal ratio. In order to solve the above problems, the paper carries out the researches on the accuracy detecting of the similarity data and the optimized accessing of the storage system respectively, which intends to find the more efficient detection technology of similar data and less overhead storage system accessing.To improve the detecting accuracy of approximately duplicated records in extensive data de-duplication, the paper makes further research on extensive data de-duplication based on the algorithm of Simhash. Under the existing algorithms, the improved algorithm has made improvement in calculation process. By introducing ICTCLAS word segmentation technology, it can generate more precise segmentations, at the same time, the segmentations are marked with the part of speech. Through setting the TF-IDF technology as the main method of calculating weight value, it can curb the negative effects of the meaningless but high-frequency words in a document. Furthermore, part-of-speech and word length are introduced as a considered weighting factor, then comparing the hamming distance between signatures to accurately identify whether they are alike, which can achieve the perfect results to determine whether they are the similar data.In order to alleviate the problem that the frequent access to storage device which caused by the indexes using in data deduplication, the paper studies deeply in data deduplication, making analysis and research on the application of Bloom Filter at the present situation of data deduplication and existing problems of the access of storage system performance to propose a high-efficiency and optimal model based on Bloom Filter. Aiming at the situation that the probability of false positives is in the nature of Bloom Filters, the paper proposes the use of an additional Bloom Filter to reduce false positive rate to achieve the purpose of reducing times of the access of storage system. Aiming at the situation that the system software errors may bring Bloom Filter false negative, the paper proposes the use of single bit error checking mechanism to prevent it from happening, at the same time, it can reduce memory overhead.Finally, the paper makes simulation analysis for the improved Simhash algorithm and Bloom Filter algorithm on the performance respectively. The simulation results show that the performance of detection of the improved Simhash algorithm is superior to the Shingle algorithm and prime Simhash algorithm, and it can improve the accuracy of the signature value. By introducing a judgement mechanism with complement Bloom Filter and single bit error checking mechanism, the optimized Bloom Filter can achieve the effects of the lower false positive rate and less access of storage system costs.
Keywords/Search Tags:Storage System, Data-deduplication, Similarity, Part Of Speech Weight, Simhash Algorithm, TF-IDF Technology, Bloom Filter, False Positive Rate, Error Checking
PDF Full Text Request
Related items