Font Size: a A A

Optimization Scheme Of Small File Processing Based On HDFS

Posted on:2019-10-27Degree:MasterType:Thesis
Country:ChinaCandidate:D Y TongFull Text:PDF
GTID:2428330548474967Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology,data shows an exponential growth trend.The existence of massive data is an opportunity and a challenge for the development of society and science and technology.Big data has brought difficulties to traditional technologies.Massive data storage and processing have received extensive attention from all walks of life.In large amounts of data,it contains a large percentage of small files.Small documents exist in people's lives in various forms.How to tap the potential value of information from a large number of small files and use it to solve people's real life problems is an urgent problem to solve.As an open source cloud computing platform,Hadoop has been widely followed by domestic and foreign experts and scholars since its release.The major Internet companies have also applied it to the development of the company.HDFS(Hadoop Distributed File System)is a distributed file system of Hadoop.It features high reliability,high concurrency,high availability,and high fault tolerance.HDFS is very effective for storing and processing big data.However,the characteristics of the master-slave architecture of HDFS have some disadvantages in the storage and processing of large-scale small files.The amount of metadata of a large number of small files has become a bottleneck problem that restricts the NameNode.and has seriously affected the reading efficiency of small files.To solve the problem of HDFS storing and processing a large number of small files,this paper proposes a dynamic queue solution that reduces the metadata in the NameNode and uses a prefetching cache strategy to improve the efficiency of reading small files.The main work of this article is as follows:(1)Analyze and study the problems existing in the storage and processing of large-scale small files by HDFS,as well as the advantages and disadvantages of the existing research schemes.Through the analysis of the HDFS system architecture and working principle to find out it's processing of small files the root cause of the problem and explore it.(2)This paper proposes a solution to dynamic queuing for the problem of HDFS storage and poor performance of processing large volumes of small files.Firstly,through the analytic hierarchy process,the NameNode memory consumption,file download speed,and file merge speed are taken as three evaluation indexes to analyze the weight of each indicator in the system performance of small file storage processing.Secondly,using the improved log function data normalization method,the data of the three indicators obtained from the experiment are standardized,and the system is quantified qualitatively.Then the system performance trend is calculated.The different ranges of small files are determined based on the trend of the system performance without unitless dimensionless pure values,and then the best queue size for each data range is calculated through three evaluation indicators.The text similarity detection method based on symbols is used to detect the similarity of small text files.Finally,select the best queue for the different ranges of small files for the combined storage,reduce the memory consumption of NameNode metadata,design secondary directories and prefetch caching strategies,and improve the efficiency of reading small files.And through experiments,this paper compares the scheme of dynamic queue with the way of directly using HDFS and using a single queue to process small files,which proves that the dynamic queue solution of this paper can effectively reduce the amount of metadata and improve the efficiency of reading small files.
Keywords/Search Tags:Hadoop, HDFS, small files, dynamic queue
PDF Full Text Request
Related items