Font Size: a A A

Research On File Accessing Performance Optimization Based On HDFS

Posted on:2016-06-26Degree:MasterType:Thesis
Country:ChinaCandidate:L J ZhuFull Text:PDF
GTID:2348330536467358Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Hadoop has been widely used in Cloud Computing and Big Data applications as an excellent fundamental clous infrastructure platform.The distributed file system HDFS preforms as underlying base for huge amouts of data storage.With the rapid growth of data amounts and types,it is essential to optimize HDFS as to satisfy high performance for file accessing and high reliability for storage.This paper concerns about the file accessing permance optimization based on Hadoop cluster structure and the storage management mechanism of HDFS.By concentating on small file problem and dynamic replication optimation,we made an indepth study.Before the design and implementation,structure of the Hadoop storage system and HDFS are studied and researches on file accessing performance and HDFS replication policies are emphatically expounded,made the fundamental of this paper.Contributions of this paper are as follows.First,to solve the HDFS small problem,reasons are firstly studied both in theoretical formula and tests.Prefetch method based on historical trace is proposed,which works as two parts of metadata pushing and blocks prefetching.NameNode is responsible for pushing metadata to the cache implemented on the node of DFClient locates.Historical trace describes the file accessing history of HDFS,indicating the repetitiveness and temporal locality of small file accessing,which can act as the effective basid for metadata puishing and file prefetching.Blocks prefetching is performed background asynchronously by DataNode after receiving the metadata cache content.Matadata cache of DataNode is able to provide the faster response for subsequential read requests.Once metadata of a small file is cached,preftcher is invloked to achieve blocks.Experimental results show the prefetching method is effective to reduce the latency for accessing large amounts of small files.Second,to solve the dynamic update of redundant replications,growth and degradation algorithm for danamica replication based on metadata cache is proposed to decrease the storage resources consumed by redundant replications.Metadata cache of DataNode records file popularity for every file accessed.Once the popularity satisfy the growth or degradation condition,replicas changes on the local node by add one or minus one copy.After the amount of replication changes,the local node should imform other node including NameNode and DataNode to update the metadata.Multicast tree is built based on the topological distance with local node,and the updated metadata is delivered hierarchical in the cluster,which decreases the load of master and speep up the update message transferring.Experimental results demonstrate the effectiveness of the method as decreasing the storage occupied by cold data and improving the accessing performance of hot data at the same time.which brings benefits for file accessing performance.Researches in this paper are useful explorations on optimaztion of small file accesing and dynamic replication with updating policies.The algorithms and methods proposed owns some theoretical value and practical meaning for improving performance of Hadoop,especially HDFS,in faced with big data storage and processing.
Keywords/Search Tags:distributed storage system, HDFS, small file problem, redundant replication, metadata multicast
PDF Full Text Request
Related items