Font Size: a A A

Research On Access Optimization Of Small Files In Hadoop Cluster

Posted on:2020-06-19Degree:MasterType:Thesis
Country:ChinaCandidate:Z MaFull Text:PDF
GTID:2428330590454677Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
In recent years,the global big data has entered the stage of accelerated development.The total amount of data increases by 50% annually,showing a trend of massive aggregation and explosive growth,leading the new characteristics of transformation.How to store,analyze and utilize these data is a major problem that human beings need to solve urgently.Hadoop,which is composed of HDFS,MapReduce,Hive,HBase and other components,has gradually become a general large data storage platform because of its excellent performance,stable and secure ecosystem and open source advantages.However,due to the limitation of the corresponding file number in the fixed NameNode memory,it is not suitable for storing large and small files.Therefore,after reading the relevant literature and studying,analyzing the process of HDFS accessing files and HBase accessing data,this paper proposes a large number of small file accessing optimization scheme.The following research has been done:In view of the fact that there are a large number of files with similar file names in real life,direct storage in HBase will lead to the problem of access hotspots,and a large number of access files will also lead to frequent interaction between HDFS clients and primary nodes.This paper proposes a data set-based row key optimization method,classification reading strategy,and implements a cache Prefetching Strategy Based on Ehcache.Firstly,before the next group of files are merged and stored,file indexes are established for large files,merged files and small files,and Row Key is stored in HBase database by splicing MD5 values,file names and data set names,so that the files are evenly distributed on each node to achieve load balancing of the system;secondly,large files,merged files and small files are marked separately.Then the Ehcache framework is used as the cache module to realize the prefetching of hot files and the caching of metadata,which further improves the file access speed of Hadoop cluster and reduces the interaction times between HDFS client and main node.Finally,the experimental results show that the method has significantly improved the reading speed.In view of the fact that there are a large number of files with similar file names in real life,direct storage in HBase will lead to the problem of access hotspots,and a large number of access files will lead to frequent interaction between HDFS clients and primary nodes.This paper proposes a row key optimization scheme based on data sets,a categorized reading strategy and a cache Prefetching Strategy Based on Ehcache.Firstly,before the next group of files are merged and stored,file indexes are established for large files,merged files and small files,and Row Key is stored in HBase database by splicing MD5 values,file names and data set names,so that the files are evenly distributed on each node to achieve load balancing of the system;secondly,large files,merged files and small files are marked separately.Then the Ehcache framework is used as the cache module to realize the prefetching of hot files and the caching of metadata,which further improves the file access speed of Hadoop cluster and reduces the interaction times between HDFS client and main node.Finally,the experimental results show that the method has significantly improved the reading speed.On the basis of the above research,the HDFS file management demonstration system is designed and implemented.According to the user's needs in the actual system development,the demo system designs two modes of ordinary users and administrators,and realizes the functions of file management and user information management.Finally,through the performance test of the HDFS file management DEMO system,the results show that the demo system can efficiently manage the massive files in HDFS.
Keywords/Search Tags:Hadoop, small files, HDFS, HBase, prefetch and cache
PDF Full Text Request
Related items