Optimization Scheme Of Small File Processing Based On HDFS

Posted on:2019-10-27

Degree:Master

Type:Thesis

Country:China

Candidate:D Y Tong

Full Text:PDF

GTID:2428330548474967

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

With the rapid development of information technology,data shows an exponential growth trend.The existence of massive data is an opportunity and a challenge for the development of society and science and technology.Big data has brought difficulties to traditional technologies.Massive data storage and processing have received extensive attention from all walks of life.In large amounts of data,it contains a large percentage of small files.Small documents exist in people's lives in various forms.How to tap the potential value of information from a large number of small files and use it to solve people's real life problems is an urgent problem to solve.As an open source cloud computing platform,Hadoop has been widely followed by domestic and foreign experts and scholars since its release.The major Internet companies have also applied it to the development of the company.HDFS(Hadoop Distributed File System)is a distributed file system of Hadoop.It features high reliability,high concurrency,high availability,and high fault tolerance.HDFS is very effective for storing and processing big data.However,the characteristics of the master-slave architecture of HDFS have some disadvantages in the storage and processing of large-scale small files.The amount of metadata of a large number of small files has become a bottleneck problem that restricts the NameNode.and has seriously affected the reading efficiency of small files.To solve the problem of HDFS storing and processing a large number of small files,this paper proposes a dynamic queue solution that reduces the metadata in the NameNode and uses a prefetching cache strategy to improve the efficiency of reading small files.The main work of this article is as follows:(1)Analyze and study the problems existing in the storage and processing of large-scale small files by HDFS,as well as the advantages and disadvantages of the existing research schemes.Through the analysis of the HDFS system architecture and working principle to find out it's processing of small files the root cause of the problem and explore it.(2)This paper proposes a solution to dynamic queuing for the problem of HDFS storage and poor performance of processing large volumes of small files.Firstly,through the analytic hierarchy process,the NameNode memory consumption,file download speed,and file merge speed are taken as three evaluation indexes to analyze the weight of each indicator in the system performance of small file storage processing.Secondly,using the improved log function data normalization method,the data of the three indicators obtained from the experiment are standardized,and the system is quantified qualitatively.Then the system performance trend is calculated.The different ranges of small files are determined based on the trend of the system performance without unitless dimensionless pure values,and then the best queue size for each data range is calculated through three evaluation indicators.The text similarity detection method based on symbols is used to detect the similarity of small text files.Finally,select the best queue for the different ranges of small files for the combined storage,reduce the memory consumption of NameNode metadata,design secondary directories and prefetch caching strategies,and improve the efficiency of reading small files.And through experiments,this paper compares the scheme of dynamic queue with the way of directly using HDFS and using a single queue to process small files,which proves that the dynamic queue solution of this paper can effectively reduce the amount of metadata and improve the efficiency of reading small files.

Keywords/Search Tags:

Hadoop, HDFS, small files, dynamic queue

PDF Full Text Request

Related items

1	Processing Of Small Files Based On HDFS And Optimization And Improvement Of The Performance For Mapreduce Computing Model
2	Research On Access Optimization Of Small Files In Hadoop Cluster
3	Research And Optimization Of Small Files Processing Techniques In Hadoop
4	The Research Of Increase The IO Speed Of Small Files In HDFS
5	Research And Implementation Of Small Files Storage Management Based On Hadoop
6	The Research And Implementation Of Method Regarding To The Small Files Problem Of Hadoop
7	A Strategy To Deal With Massive Small Files In Hadoop Distributed File Systems
8	Research On Access Strategy Of Massive Small Files Based On Hadoop
9	Optimization Of Massive Small Files On Hadoop Cluster
10	Optimization And Implementation Of Small File Storage In HDFS Under Hadoop Platform