Research And Optimization Of Massive Small File Storage Based On Hadoop Cluster

Posted on:2024-07-25

Degree:Master

Type:Thesis

Country:China

Candidate:C H Li

Full Text:PDF

GTID:2558307085958819

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Big data technology comes into being with the explosive growth of information,among which the big data platform Hadoop is widely used in the industry.At the same time when it is widely used,through in-depth research on its principle,it is found that due to its structural design and original intention,Hadoop is better at processing large streaming files,while it is less efficient in processing large amounts of small files.Due to the HDFS structure design in Hadoop,when HDFS stores massive small files,a large amount of NameNode memory space is occupied by the metadata of small files.Data size differences affect the storage allocation of Datanodes,resulting in resource waste.This thesis aims to study and optimize the low efficiency of Hadoop in the context of storing and processing massive small files,so as to improve the ability of Hadoop to process small files.Aiming at the low efficiency of Hadoop in storing and processing small files,this thesis makes an in-depth study of the architecture principles of Hadoop,focuses on the underlying principles and workflow of distributed storage system HDFS,and fully absorbs the advantages of existing Hadoop solutions and optimization solutions provided by previous scholars.At the same time,the existing disadvantages and shortcomings are analyzed,and on this basis,the multilevel processing model of HMPT(Hadoop multi-level processing template)is proposed.Compared with other optimization solutions,the advantage of this processing model is that it has a relatively comprehensive processing flow,and the methods used by each processing module are optimized and improved.HMPT model is mainly composed of five sub-modules,which are file judgment unit respectively.The uploaded files are judged and processed,small files are filtered out and put into the subsequent processing module,and large files are directly uploaded to the native system.Text file classification and processing module: In order to make up for the problem that existing solutions ignore file relevance,the text is preliminarily classified according to the file type,and then the text is twice classified by text-CNN algorithm.File secondary merging module: A secondary merging algorithm based on space optimization is proposed to merge small files and store them,so as to relieve the pressure of NameNode.File index module: using HKD-tree to establish efficient index,improve the efficiency of file reading;Redis cluster cache module: optimized and established the cache mechanism module based on Redis cluster,greatly improving the efficiency of hot file query.Finally,the multi-directional comparison experiment was carried out by uploading massive small files to the Hadoop cluster.The memory usage of NameNode,single-client read and write test,multi-client concurrent read and write test,hot file read and write test and other methods are respectively compared with the native file system,the existing mechanism of Hadoop platform(HAR,SF)and the time consumption of HMPT proposed in this thesis.Finally,it is verified that the HMPT model proposed in this thesis has optimized and improved the storage and processing of massive small files on the basis of ensuring the integrity of the original Hadoop platform’s functionality and operability.

Keywords/Search Tags:

HDFS, Massive small files, File classification, File merging, Hadoop

PDF Full Text Request

Related items

1	Optimization Study On Storing Massive Small Files Based On Hadoop
2	Research On Storage Strategy Of Massive Small Files Based On HDFS
3	The Research Of HDFS Optimization Towards Lots Of Small Files Accessing And Storage
4	Research And Implementation Of Hadoop Small File Processing Technology
5	The Design And Implementation Of Massive Small Files Storage System Based On HDFS
6	Research On Optimization Strategy Of Small File Access Based On Hadoop
7	Optimization And Implementation Of Small File Storage In HDFS Under Hadoop Platform
8	A Strategy To Deal With Massive Small Files In Hadoop Distributed File Systems
9	Research On Access Strategy Of Massive Small Files Based On Hadoop
10	Optimization Of Massive Small Files On Hadoop Cluster