Font Size: a A A

Study On Processing Of Massive Small Files Based On Hadoop

Posted on:2012-01-12Degree:MasterType:Thesis
Country:ChinaCandidate:D X TaiFull Text:PDF
GTID:2218330338953836Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In the modern era of high-speed growth of data, traditional database has been unable to meet the needs of dealing with large-scale data. With the development of distributed file system, Hadoop appeared in people's horizon gradually. It is a large cluster which is major composed by HDFS (Distributed by File System) and MapReduce. At present, many large enterprises are dealing with huge amounts of data using Hadoop and got good results. But gradually, people discovered that there's a very big flaw that it can efficiently handle large files, but in dealing with massive small files it take up the rate of Namenode and the efficiency of access is slow. When massive small documents appear in HDFS, a lot of metadata make the memory of Namenode exhausted and there will be a large number of small files waiting to be processed by abundant map tasks. Therefore, a study on processing of massive small files based on Hadoop is one of the most important technologies on Hadoop research.In order to solve the problem of low efficiency of searching small files and reduce the cost of memory resources when process massive small files on Hadoop. This paper focuses on the following contents:1) Designed a Small Files Management System, which is to deal with massive small files specifically. It included combiner, Indexer, Sorter and Placer. It's main idea is to store the index file in datanode.2) Based on the phenomenon that small files often need time-related batch removing, we merge small files according to time, to guarantee that it can achieve optimal effect in batch removing; This paper arised the index structure which is suitable for small files, and describes the design principle and process in detail. Meanwhile, considering the characteristics of HDFS, this paper also puts forward the subdivision strategy and placement strategy for index files after creating index from small files, so as to improve the efficiency of retrieval and reduce the cost of the communication among the machines.3) Put forward a batch file refresh strategy, including add strategy and delete strategy of small files. Thus made Hadoop support small documents'updates, and enhance the updating efficiency. 4) Use Java to design an experiment to verify our thought. Through the test, we declared the performance of the methods of storing and updating small files on Hadoop. This paper improved the resource utilization and system response rate effectively, reduced the burden of the namenode, and to a certain extent improved the problem of updating documents.
Keywords/Search Tags:Hadoop, Massive Small Files, SFMS, Index Structure, Batch Remove
PDF Full Text Request
Related items