Font Size: a A A

Optimization And Implementation Of Small File Storage In HDFS Under Hadoop Platform

Posted on:2020-01-18Degree:MasterType:Thesis
Country:ChinaCandidate:Q LuoFull Text:PDF
GTID:2428330590950376Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of the Internet,the big data technology has emerged as the trend of information explosion.In the face of extremely large data,a variety of distributed file systems provide solutions for the storage of big data.Hadoop is widely used in the industry due to its advantages of high scalability and high reliability.HDFS,as the core component of Hadoop,provides file storage service for processing big data.HDFS,however,is better at handling large,streaming files,and does not perform well when storing large amounts of small files.In order to solve the problem of low efficiency in HDFS storage of Small Files,this paper analyzes the Hadoop architecture and the process of HDFS storage of Files in detail,and proposes a scheme of introducing Multilevel Processing Module(MPM)for Small Files.This scheme firstly filters the files that send operation requests in the system through the file preprocessing module.Files under 4.35 MB are screened as small files and are preliminarily classified according to file extensions.The file merge module then merges the pre-processed small files into as few large files as possible to reduce the system NameNode memory load.In order to improve the query speed of small files,the scheme not only USES the creation time of small files and the secondary index module established by the extension of small files,but also introduces the prefetch and cache module based on the common files of users.Finally,aiming at the fragmentation problem caused by the long running of the system,when the system meets the set conditions,the defragmenting module will clear the blank space of the merged files to improve the utilization rate of the system space.This paper compares the proposed MPM scheme with three existing HDFS storage schemes: native storage scheme,HAR File archive scheme and Sequence File scheme.When 100,000 files are accessed,the MPM scheme can save the system 95.56% in memory usage and 99.92% in space utilization.Under the same conditions,compared with the native storage scheme,the write rate of the MPM scheme is twice as high as that before optimization.Because of the more steps in the merge mechanism,the write time is reduced by only 31%.The reading rate is improved by about 2.25 times,and the reading time is the lowest among all schemes.Experimental results show that MPM scheme improves HDFS storage performance significantly.Greatly reduce the number of files in the system,effectively reduce the NameNode memory load,improve the system memory utilization,and achieve high rate of small file read and write performance.
Keywords/Search Tags:HDFS, lots of small files, file merge, secondary index
PDF Full Text Request
Related items