Font Size: a A A

Optimization Of Massive Small Files On Hadoop Cluster

Posted on:2015-01-18Degree:MasterType:Thesis
Country:ChinaCandidate:C ZhangFull Text:PDF
GTID:2298330452450758Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of Mobile Internet and Internet of Things, the amount ofdata on the internet is growing exponentially, and the traditional technical architecturein processing massive amount of data has become weak. Hadoop, as a technicalframework that can process massive data efficiently, has received more and moreattention by the industry. Hadoop uses master-slaves architecture design pattern andconsists of HDFS file system and MapReduce computing framework. The singlename node design of HDFS file system can simplify the management of file systems,but it also leads to the low efficiency of processing small files. Based on the study onhow to process massive small files in industrial and academic circles and thetechnical details of Hadoop and the ecological system, this page presents problemsthat the current solution does not take the diversity and repeatability of the file typeinto consideration and does not thoroughly solve the single point problem of Hadoopcluster. Therefore, this thesis puts forward a plan to optimize the Hadoop clusterusing the related components of Hadoop to improve the performance of processingmass small files.In this thesis, the MD5algorithm is used to determine whether two files areduplicate by comparing content of the two files. In this way, it can reduce the numberof written files to reduce the consumption of the disk. In this thesis, MapFile is usedto merge small files and store different files by files’ size. If the file is small, it will beput in multi-level mergers queue according to the file type. When the queue thresholdis reached, the small files of the queue will be merged and written into the HDFS. Inthis way, it can reduce the number of files at a certain degree. In this thesis, HBase isused to store persist index information.It will not only ensure effective level of datareading and writing, but also provide stable services outside by using the cache tostore the index and providing the consistency protection for the data in cache andIndexer. This thesis presents a "mark-delete-compress" method to delete files.Whenreceived the deletion request, it will modify the flag of the small file in cache. Whenthe small file is deleted, the cluster will compress the large file which the small file locates in. By this method,on one hand,the deletion rate will be improved. On theother hand, the space debris that is generated by deleting small files will be reduced.This thesis presents a simple upload and download system, mainly to completemodule design and implementation about the file upload and download before andafter optimization. And test the efficiency of the system to read and write. Alsocompare and analyze the consumption of memory of the master node, disk, networkbefore and after the optimization. And the finally result is that the optimizationscheme has a better effect compared with the traditional Hadoop.
Keywords/Search Tags:Hadoop, small file, HDFS, MD5, MapFile
PDF Full Text Request
Related items