With the arrival of the age of big data, the data size only produced in China per year has reached ZB level, thus bringing increasing difficulties to the traditional database technology and system for their storage and processing. Hadoop, a data processing platform which developed by Apache Software Foundation in recent years, is an open-source and distributed data processing framework with advantages of reliability, scalability, efficiency, extensibility and low cost in processing big data. HDFS, as the file system of Hadoop, inherits the advantages of traditional storage system, and functions by its pioneering design ideas to deal with massive data, making itself become a storage and transmission system for extended files. However, the emergence of a series of performance bottlenecks such as high occupancy of namenode, low efficiency to access and retrieval in processing a mass of small files by Hadoop have naturally made it an important research topic in the academic and industrial circles.To solve the above problems, this paper mainly aims to reduce the number of small files in way of classification and combination after determining their relevance by BM25 algorithm, so as to alleviate the memory bottlenecks in metadata storage of small files caused by namenode, and establish an index mechanism with file combination, thus realizing the access to both collected files and a single file. In addition, the prefetching mechanism for related small files and their index will be respectively provided according to file correlation, which will be conducive to reducing the load generated by metadata request of namenode, and better recognizing the requests mentioned above. In the whole system, Cache will be established by the use of memory database technology(Memcached) to complete mapping and prefetching relevant files, and to improve the retrieval efficiency of small files.Systematic test for improved processing scheme of small files is depicted in this paper, in which, network data as the experimental data, is designed into three test cases, and their writing-in and reading-out time, RAM consumption, storage performance are also analyzed. The test shows that the solution proposed in this paper can effectively improve the storage and reading efficiency of small files, to a certain extent, perfecting the storage performance of the Hadoop. |