Research On Access Optimization Of Small Files In Hadoop Cluster

Posted on:2020-06-19

Degree:Master

Type:Thesis

Country:China

Candidate:Z Ma

Full Text:PDF

GTID:2428330590454677

Subject:Control Science and Engineering

Abstract/Summary:

PDF Full Text Request

In recent years,the global big data has entered the stage of accelerated development.The total amount of data increases by 50% annually,showing a trend of massive aggregation and explosive growth,leading the new characteristics of transformation.How to store,analyze and utilize these data is a major problem that human beings need to solve urgently.Hadoop,which is composed of HDFS,MapReduce,Hive,HBase and other components,has gradually become a general large data storage platform because of its excellent performance,stable and secure ecosystem and open source advantages.However,due to the limitation of the corresponding file number in the fixed NameNode memory,it is not suitable for storing large and small files.Therefore,after reading the relevant literature and studying,analyzing the process of HDFS accessing files and HBase accessing data,this paper proposes a large number of small file accessing optimization scheme.The following research has been done:In view of the fact that there are a large number of files with similar file names in real life,direct storage in HBase will lead to the problem of access hotspots,and a large number of access files will also lead to frequent interaction between HDFS clients and primary nodes.This paper proposes a data set-based row key optimization method,classification reading strategy,and implements a cache Prefetching Strategy Based on Ehcache.Firstly,before the next group of files are merged and stored,file indexes are established for large files,merged files and small files,and Row Key is stored in HBase database by splicing MD5 values,file names and data set names,so that the files are evenly distributed on each node to achieve load balancing of the system;secondly,large files,merged files and small files are marked separately.Then the Ehcache framework is used as the cache module to realize the prefetching of hot files and the caching of metadata,which further improves the file access speed of Hadoop cluster and reduces the interaction times between HDFS client and main node.Finally,the experimental results show that the method has significantly improved the reading speed.In view of the fact that there are a large number of files with similar file names in real life,direct storage in HBase will lead to the problem of access hotspots,and a large number of access files will lead to frequent interaction between HDFS clients and primary nodes.This paper proposes a row key optimization scheme based on data sets,a categorized reading strategy and a cache Prefetching Strategy Based on Ehcache.Firstly,before the next group of files are merged and stored,file indexes are established for large files,merged files and small files,and Row Key is stored in HBase database by splicing MD5 values,file names and data set names,so that the files are evenly distributed on each node to achieve load balancing of the system;secondly,large files,merged files and small files are marked separately.Then the Ehcache framework is used as the cache module to realize the prefetching of hot files and the caching of metadata,which further improves the file access speed of Hadoop cluster and reduces the interaction times between HDFS client and main node.Finally,the experimental results show that the method has significantly improved the reading speed.On the basis of the above research,the HDFS file management demonstration system is designed and implemented.According to the user's needs in the actual system development,the demo system designs two modes of ordinary users and administrators,and realizes the functions of file management and user information management.Finally,through the performance test of the HDFS file management DEMO system,the results show that the demo system can efficiently manage the massive files in HDFS.

Keywords/Search Tags:

Hadoop, small files, HDFS, HBase, prefetch and cache

PDF Full Text Request

Related items

1	Optimization Study On Storing Massive Small Files Based On Hadoop
2	Design And Implementation Of Cloud Storage System Based On Hadoop
3	The Research Of HDFS Optimization Towards Lots Of Small Files Accessing And Storage
4	Research And Design Of Massive Small Files Merging Based On Hadoop
5	Processing Of Small Files Based On HDFS And Optimization And Improvement Of The Performance For Mapreduce Computing Model
6	Design And Implementation Of Disk Cache System Based On HDFS Optical Jukebox
7	Research On Storage Strategy Of Massive Small Files Based On HDFS
8	Research And Optimization Of Small Files Processing Techniques In Hadoop
9	The Research On Storage Of Massive Small Air Cargo Files Based On Hadoop
10	Research And Optimization Of Mass Small Files Based On HDFS