Font Size: a A A

Research On The Optimization Of Storage Performance Of Massive Chinese Text Small Files In Ceph

Posted on:2020-01-21Degree:MasterType:Thesis
Country:ChinaCandidate:Y FanFull Text:PDF
GTID:2428330599459736Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As the rapid development of Internet,mobile Internet,EPC network,there are massive small files generated by various fields,such as e-commerce,social networking,mobile applications,network education,scientific research experiments,etc.The small text files as the most typical small files have three features: large quantity,few data and high redundancy,thus lots of distributed storage system face with three challenges while storing massive small files,which are difficulty of metadata management,too many disk I/O streams and low efficiency of file access.Currently,the most popular distributed file systems are mostly designed in two patterns: decentralized and centralized.For example,as a decentralized file system,Ceph is designed for large file storage,through it avoids the bottleneck of storage space,it still will suffer from performance degradation while storing massive small files due to the double-write and backup strategies.Therefore,innovative work of this paper summarized as follows:1)As for the problem of multiple I/O streams caused by the storage of massive Small files,this paper presents a preprocessing architecture SFPS(Small File Preprocess System)for storing massive small files in Ceph,which adopted three technologies,including: clustering,data deduplication and files combination.The designed method can reduce the cost of storing massive files by merging similar files after the deduplication with adaptive block skipping.2)To increase the low file access efficiency of the storage of massive small files,we firstly introduced Redis database as a high-performance carrier for cache.Then we also implement a prefetching mechanism by calculating the hamming distance between the files,and give a cache replacement optimization algorithm based on multiple factors BME(Based on Multiple Elements)where we add file access interval,access frequency and file size to the algorithm in view of expressing the value of cached object more reasonably.Finally,a three-level cache structure and dynamic cache elimination strategy are given to improve file read rate.3)A module is given to delete and modify the small files in the merged files.It provides the user with the function of atomicity operation on the merged small files while reducing the space debris in the large files.This paper aims to trade off the data access time,data storage space,bandwidth occupation and cluster working efficiency in practical applications while storing massive small files in Ceph,Experimental results show that the scheme proposed in this paper can not only improve the data read rate and reduce the disk I/O stream,but also effectively reduce the transmission bandwidth and the occupation of storage space.In addition,the cache hit ratio of the multi-level dynamic elimination cache structure and cache replacement optimization algorithm presented in this paper is better than the traditional cache replacement strategy.
Keywords/Search Tags:Ceph, small files, clustering, data deduplication, cache algorithms
PDF Full Text Request
Related items