Font Size: a A A

Optimization Of Small Files Accessed Base On MapFile In HDFS

Posted on:2018-05-30Degree:MasterType:Thesis
Country:ChinaCandidate:X L HongFull Text:PDF
GTID:2348330518469913Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rise of social networking and the rapid development of the Internet of things,various forms of data are explosive growth and accumulation.Today,Apache Hadoop has become the driving force while large data industry develops,And it has become the first choice for many companies.Hadoop is a new type of computing architecture,which can be used to manage distributed data in parallel.It provides inexpensive processing of large data,it is also becoming more and more popular because high fault tolerance and expansibility.Distributed File System is one of the core of Hadoop.HDFS uses a master/slaves structure,the system has only one master node(NameNode)and can have multiple slave nodes(DataNode),That Makes it great for accessing large files.However,such a structure also brings drawbacks.A large amount of metadata can be generated when accessing massive small files.And the system maintains the heartbeat mechanism of these data information,which will bring great pressure to the NameNode,making the system access efficiency low,becoming one of the performance bottlenecks of Hadoop.Modern and future cloud computing,small files is the main data information,how effective access to massive small files has become a lot of Internet companies competing to research the problem.Due to Distributed File System access efficiency low in accessing massive small files,Hadoop itself provides Hadoop Archiver(HAR)and Sequence File,etc.These programs can be very effective in reducing the number of master node metadata,which to some extent improve the system in the small file access efficiency.However,due to the randomness of HDFS storage metadata and indexes,the amount of resources consumed may be increased when reading files.If accessing files which does not exist in the system,we need to traverse the entire fsimage,that will make a huge waste of NameNode limited resources.In view of this,On the basis of absorbing the excellent strategies of other researchers,this paper proposes an optimization of small files accessed base on MapFile in HDFS.when storing small files,files will be classified according to the type of small files and access rights on the client,and select the same type and the same access rights of small files to be classified into MapFile,Then the large file will be handed over to HDFS.According to the impact of cache technology on data hit rate,we introduce a cache module composed of Nexist file buffer region and multi-level cache,which can effectively improve the efficiency of file hit,avoid the frequent access to Name Node and reduce the pressure of NameNode.Establish three system environment which based on the analysis and research of the traditional HDFS,HDFS based on MapFile and the optimized HDFS,test and analyze memory consumption of their NameNode and access efficiency.Experiments show that,this method can effectively reduce the consumption of memory in NameNode when accessing massive small files,and can reduce the fetch time for small files.In addition,it can effectively solve the problem of NameNode traversing all indexes when there is no file in the access system,so as to optimize the access efficiency of the system as a whole.
Keywords/Search Tags:HDFS, MapFile, small files, access, Cache
PDF Full Text Request
Related items