Study On Processing Of Massive Small Files Based On Hadoop

Posted on:2012-01-12

Degree:Master

Type:Thesis

Country:China

Candidate:D X Tai

Full Text:PDF

GTID:2218330338953836

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

In the modern era of high-speed growth of data, traditional database has been unable to meet the needs of dealing with large-scale data. With the development of distributed file system, Hadoop appeared in people's horizon gradually. It is a large cluster which is major composed by HDFS (Distributed by File System) and MapReduce. At present, many large enterprises are dealing with huge amounts of data using Hadoop and got good results. But gradually, people discovered that there's a very big flaw that it can efficiently handle large files, but in dealing with massive small files it take up the rate of Namenode and the efficiency of access is slow. When massive small documents appear in HDFS, a lot of metadata make the memory of Namenode exhausted and there will be a large number of small files waiting to be processed by abundant map tasks. Therefore, a study on processing of massive small files based on Hadoop is one of the most important technologies on Hadoop research.In order to solve the problem of low efficiency of searching small files and reduce the cost of memory resources when process massive small files on Hadoop. This paper focuses on the following contents:1) Designed a Small Files Management System, which is to deal with massive small files specifically. It included combiner, Indexer, Sorter and Placer. It's main idea is to store the index file in datanode.2) Based on the phenomenon that small files often need time-related batch removing, we merge small files according to time, to guarantee that it can achieve optimal effect in batch removing; This paper arised the index structure which is suitable for small files, and describes the design principle and process in detail. Meanwhile, considering the characteristics of HDFS, this paper also puts forward the subdivision strategy and placement strategy for index files after creating index from small files, so as to improve the efficiency of retrieval and reduce the cost of the communication among the machines.3) Put forward a batch file refresh strategy, including add strategy and delete strategy of small files. Thus made Hadoop support small documents'updates, and enhance the updating efficiency. 4) Use Java to design an experiment to verify our thought. Through the test, we declared the performance of the methods of storing and updating small files on Hadoop. This paper improved the resource utilization and system response rate effectively, reduced the burden of the namenode, and to a certain extent improved the problem of updating documents.

Keywords/Search Tags:

Hadoop, Massive Small Files, SFMS, Index Structure, Batch Remove

PDF Full Text Request

Related items

1	Research On Processing Techniques Of Massive Small Files Based On Hadoop
2	Research And Implementation Of Small File Processing Techniques In Hadoop
3	Optimization Study On Storing Massive Small Files Based On Hadoop
4	Research And Design Of Massive Small Files Merging Based On Hadoop
5	Design And Implementation Of The Key Techniques For Storing And Retrieving Massive Small Files In Hadoop
6	The Design And Implementation Of Massive Small Files Storage System Based On HDFS
7	Research And Application Of Massive Small Files Processing Techniques Based On Hadoop
8	Research And Implementation Of Fast Retrieval Technology For Massive Small Files
9	Research And Optimization Of Small Files Processing Techniques In Hadoop
10	The Research And Implementation Of Storing Massive Small Files In Cloud Storage