| The rapid development of the Internet has led to the explosive growth of information. Nowadays, the amount of information producing by the Internet achieves the EB level every day, how to store and manage vast amounts of data is a major challenge for individuals and companies. Although there are many types of storage system and the storage capacity is also changing constantly, but store all of the information without screening is clearly not a wise choice. As an effective solution, data de-duplication technology attracts people’s attention. Currently, data de-duplication technology is widely used in backup systems, the corresponding technology is mature. But in online system the use of data de-duplication technology is infrequent for the special nature of the online system, especially for real-time requirements. The de-duplication technology needed to solve more problems.Pbfs is a middleware applied to all major file system, Pbfs used a number of special solutions based on the characteristics of inline system. Firstly, Pbfs put forward the thought of document classification, processing different types of documents via using the most suitable way. Secondly, Pbfs improved the similarity determination algorithm, the new algorithm can improve the accuracy of recognition. Finally Pbfs letted metadata class dynamic, metadata is the most important part of a de-duplication system, making metadata dynamic can improve re-rate. Pbfs purposed of these solutions is to reduce the computational overhead of the system to the maximum extent, while increasing the number of indicators to weight ratio.Test results of various data sets show Pbfs performance well compared to the ZFS de-duplication and iDedup, especially the time delay effect is obvious, which in terms of the online system is very attractive. Pbfs based ProSy constructed on the basis of similarity ProSy determination algorithms and the basic data structure has been optimized and improved, and compared with ProSy, Pbfs has a corresponding increase in the rate of weight and to read and write throughput aspect, significant performance improvements, to achieve the desired results. |