Big data technology comes into being with the explosive growth of information,among which the big data platform Hadoop is widely used in the industry.At the same time when it is widely used,through in-depth research on its principle,it is found that due to its structural design and original intention,Hadoop is better at processing large streaming files,while it is less efficient in processing large amounts of small files.Due to the HDFS structure design in Hadoop,when HDFS stores massive small files,a large amount of NameNode memory space is occupied by the metadata of small files.Data size differences affect the storage allocation of Datanodes,resulting in resource waste.This thesis aims to study and optimize the low efficiency of Hadoop in the context of storing and processing massive small files,so as to improve the ability of Hadoop to process small files.Aiming at the low efficiency of Hadoop in storing and processing small files,this thesis makes an in-depth study of the architecture principles of Hadoop,focuses on the underlying principles and workflow of distributed storage system HDFS,and fully absorbs the advantages of existing Hadoop solutions and optimization solutions provided by previous scholars.At the same time,the existing disadvantages and shortcomings are analyzed,and on this basis,the multilevel processing model of HMPT(Hadoop multi-level processing template)is proposed.Compared with other optimization solutions,the advantage of this processing model is that it has a relatively comprehensive processing flow,and the methods used by each processing module are optimized and improved.HMPT model is mainly composed of five sub-modules,which are file judgment unit respectively.The uploaded files are judged and processed,small files are filtered out and put into the subsequent processing module,and large files are directly uploaded to the native system.Text file classification and processing module: In order to make up for the problem that existing solutions ignore file relevance,the text is preliminarily classified according to the file type,and then the text is twice classified by text-CNN algorithm.File secondary merging module: A secondary merging algorithm based on space optimization is proposed to merge small files and store them,so as to relieve the pressure of NameNode.File index module: using HKD-tree to establish efficient index,improve the efficiency of file reading;Redis cluster cache module: optimized and established the cache mechanism module based on Redis cluster,greatly improving the efficiency of hot file query.Finally,the multi-directional comparison experiment was carried out by uploading massive small files to the Hadoop cluster.The memory usage of NameNode,single-client read and write test,multi-client concurrent read and write test,hot file read and write test and other methods are respectively compared with the native file system,the existing mechanism of Hadoop platform(HAR,SF)and the time consumption of HMPT proposed in this thesis.Finally,it is verified that the HMPT model proposed in this thesis has optimized and improved the storage and processing of massive small files on the basis of ensuring the integrity of the original Hadoop platform’s functionality and operability. |