Font Size: a A A

Research And Implementation Of Small Files Storage Management Based On Hadoop

Posted on:2016-10-08Degree:MasterType:Thesis
Country:ChinaCandidate:D P ZuoFull Text:PDF
GTID:2298330467979198Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Currently, the data information of the Big Data Time is in the explosive growth, the traditional technical architecture could’t meet the needs of processing huge amounts of data.The Hadoop platform is quickly applied to various fields which are developed by the Apache foundation, becomes the first choice of many enterprises. Hadoop as a huge platform to analysis amounts of data, with the characteristics of high fault tolerance, easy extension and inexpensive storage, is a file storage system designed to store large files. But along with the rapid development of social networking and mobile Internet technology, the Internet has produced a huge number of small files, so the HDFS is used for storage of small files in the research and application. Due to the HDFS using master-slave architecture model, the metadata of a huge number of small files broughts heavy memory pressure to the master node, makes the read and write efficiency of the system declined, causes performance bottleneck.The Hadoop platform has the disadvantage of the node memory consumption and reads slowly on the problem when dealing with the huge amounts of small files. Through the existing strategy based on merging small file and the detailed quantitative analysis, the part of this problem could be solved. But the related processing schemes have the flaws in the design of index mechanism and don’t consider the file correlation, which caused the problems such as the slow read operation on the small file and the overweight of NameNode and lack of practicality, etc. In terms of good strategy based on other researchers and basis analysis of the quantitative method of Memory consumption and access performance, this paper puts forward an optimization scheme based on merging algorithm for the minute period and multi-level index for the problems of the excessive memory consumption of NameNode and low document retrieval efficiency when the Hadoop platform handls the massive amounts of the small files. The main idea of the optimization method chooses small files of the same minute time period to merge and storage according to the creation time of the small file, and based on the mapping relationship between the creation time of small file to the name of the merger file to build the global index of the small files to the message of the blocks and the DataNodes. In addition, according to the small file name and extension, this paper builds the Trie tree index for the small file to a concrete block and blokc address information, and shards the index with the extension to build the local double index mechanism, and this local double index is placed in the memory of the DataNode to speed up small file retrieval performance of the HDFS. This paper gives the specific implementation of this optimization scheme on the Hadoop cluster, including the related algorithm implementation of merging small files, custom input shard on MapReduce, the establishment of global index and local double index, etc. as well as solving the technical problem of the set for master-slave node. In addition, puts forward the quantitative analysis of the optimization scheme for handling small files.Through the comparison test and analysis of the small file optimization scheme proposed in this paper and the HAR archive technology on three part of six points, the test results show that the small file scheme proposed in this paper has the same effective with the HAR archive technology for the problom of the excessive memory consumption when handling small files. In addition, the Multi-level index mechanism is the more effective way to reduce the memory consumption of the NameNode than the double index of the HAR archive technology when retrieving the small file and this mechanism also improves the retrieval efficiency of the small file on the hadoop platform.
Keywords/Search Tags:Hadoop, HDFS, Small Files, Storage, Merge, Retrieve, Index
PDF Full Text Request
Related items