Font Size: a A A

Multiple HDFS Unified Namespace Management And Performance Optimization For Alluxio Data Access

Posted on:2019-08-15Degree:MasterType:Thesis
Country:ChinaCandidate:Z HuangFull Text:PDF
GTID:2428330545976778Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of computer and information technology,traditional single-machine system cannot handle the ever-increasing amount of large-scale data.Big data technology has emerged.Distributed file systems play an important role in the big data ecosystem.HDFS(Hadoop Distributed File System)has become a widely used big data storage system because of its reliability and good scalability.HDFS adopts a typical master-slave architecture.The metadata is stored in the memory of HDFS NameNode,which limits the capacity of metadata.The common scale-out solution is to add the number of NameNodes in the HDFS cluster.Each NameNode manages its own namespace and brings multiple HDFS namespaces in thr HDFS cluster.The multiple HDFS namespace management solution provides a unified namespace for upper-layer applications,which relieves the upper-layer application from the burden of using multiple HDFS namespaces and eliminates the single-point bottleneck problem caused by a single NameNodeThe existing multiple HDFS namespace management solutions provide a unified namespace management with problems and deficiencies such as complex management and difficulty in use.Taking ViewFS as an example,when an HDFS namespace managed by ViewFS changes,it requires all upper-layer applications to modify the configuration.Therefore,it is obviously inconvenient to use and has deficiencies in ease-to-use.There is currently a lack of a multi-HDFS namespace management solution that takes both ease to use and metadata access performance into account.In addition to the complexity of multiple namespace management,HDFS suffers from poor data access performance due to its disk-based architecture.As the available memory space in the server becomes larger and larger,the emergence of distributed in-memory file system improves the data aceess performance of upper-layer applications.Alluxio is a widely-used distributed in-memory file system.However,Alluxio uses a large data block size,and the underlying single-machine in-memory file system ramfs used by Alluxio uses a small data block size,which incurs many page faults during the reading process and seriously affects the efficiency of in-memory data reading.In order to solve the above problems,the research work of this paper centered on the distributed file system,including multiple HDFS namespace management and distributed in-memory file system Alluxio read performance optimization.Following are the primary contributions of this paper.(1)A multiple HDFS namespace management method.We Analyse on the existing multiple HDFS namespace management methods,summarizing the problems and disadvantages of existing methods in terms of ease-to-use and metadata access performance,we propose a multiple HDFS namespace management method based on Alluxio distributed file system called Alluxio ProxyFS,taking into account both ease-to-use and metadata access performance.For the problem of low data access efficiency of HDFS,we implement a cache layer of multiple HDFS based on Alluxio.By providing 2 ways of setting cache directory and configuring cache quota to promote data into Alluxio,the data access performance of upper-layer application rein a performance gain.(2)Read performance optimization of distributed in-memory file system Alluxio.We analyze on the in-memory file system ramfs and read method mmap used by Alluxio and propose 2 methods to improve Alluxio read performance.For scenarios that read the same file region for multiple times,we add a reference count parameter to the file region and add a cache queue to defer the munmap system call to free the file region read by Alluxio client,which reduces the page faults during reading the same file region.For general Alluxio read scenarios,we use the in-memory file system hugetlbfs to replace ramfs as the in-memory storage for Alluxio,which reduces the page faults in the process of Alluxio client reading files and improves Alluxio's overall read performance.(3)The HDFS and Alluxio optimized based on the above key technologies and methods have been tested for compatibility,the existing big data systems such as Hadoop,Spark,HBase,Hive,Flume,Sqoop and Druid can run smoothly on them.Some compatibility test achievements have been submitted to the open source community and started to be used in the industry.The experimental results show that the performance of metadata access of Alluxio ProxyFS improves by about 60%compared to the existing multiple HDFS namespace management solutions,and the data access performance of iterative applications improves by about 8%when running on Alluxio ProxyFS.Alluxio packet-level caching improves the performance of multiple threads reading the same file by about 20%and improves the performance of iterative reading the same file by nearly 100%.Alluxio-on-hugetlbfs improves the read performance of Alluxio by approximately 95%.HDFS and Alluxio systems optimized based on the above key technology methods have been launched on big data platform of Suning Commerce CO.LTD,which has significantly improved the distributed data management capabilities and data access performance of Suning.
Keywords/Search Tags:management of multiple HDFS namespaces, distributed in-memory file system, performance optimization
PDF Full Text Request
Related items