Hadoop possesses powerful ability of storage and analysis for mass data as a developed distributed system and performs well in many applications.The HDFS(Hadoop Distributed File System) talked about in this paper is an essential part of hadoop. It is designed to store large files with access patterns of streaming data, which is suitable for large datasets analysis. However, large number of small files are also required to be coped with in many scenarios, and HDFS seems not to be very good at it. Flaws show their faces when the number of small files grows significantly. To be specific, the memory of namenode is consumed dramatically and thus become the bottleneck of the whole system. Moreover, the efficiency of accessing small files drops a lot; And resources of mapreduce are wasted at the same time.By due analysis, the key solutions to these problems are revealed and as follows: the total number of files has to be reduced, and so do the times of communication between clients and namenode during accessing files.Following this route, small files are merged into big ones, which wiil be stored into the system as a whole, to make the file information maintained by namenode shrink, and thus alleviate the memory pressure. Then taking advantages, such as fast searching and elements being stored in order, of bplustree, an index based on bplustree is built to record the relationship between small files and merged ones. And some alternations have been done to the bplustree to help client get the index information of file-path-related and upload-timerelated files when accessing a file for the purpose that the frequency the index is read could decrease because there is no need to get index from namenode when it is already available locally. Furthermore, the data block which contains the required file is prefetched and index is constructed to maintain relevant information so that the communications could be saved when accessing files in the same block.Then, a hadoop cluster is built and the method mentioned above is put into use on it, and after which the performances regarding to memory and accessing efficiency are measured.As the experiments show, the efficiency of memory usage and files accessing improved noticeably by the above method compared with the original HDFS. |