Font Size: a A A

Research And Application Of Massive Small Files Processing Techniques Based On Hadoop

Posted on:2016-08-15Degree:MasterType:Thesis
Country:ChinaCandidate:Y F YaoFull Text:PDF
GTID:2308330473965473Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the arrival of the age of big data, the data size only produced in China per year has reached ZB level, thus bringing increasing difficulties to the traditional database technology and system for their storage and processing. Hadoop, a data processing platform which developed by Apache Software Foundation in recent years, is an open-source and distributed data processing framework with advantages of reliability, scalability, efficiency, extensibility and low cost in processing big data. HDFS, as the file system of Hadoop, inherits the advantages of traditional storage system, and functions by its pioneering design ideas to deal with massive data, making itself become a storage and transmission system for extended files. However, the emergence of a series of performance bottlenecks such as high occupancy of namenode, low efficiency to access and retrieval in processing a mass of small files by Hadoop have naturally made it an important research topic in the academic and industrial circles.To solve the above problems, this paper mainly aims to reduce the number of small files in way of classification and combination after determining their relevance by BM25 algorithm, so as to alleviate the memory bottlenecks in metadata storage of small files caused by namenode, and establish an index mechanism with file combination, thus realizing the access to both collected files and a single file. In addition, the prefetching mechanism for related small files and their index will be respectively provided according to file correlation, which will be conducive to reducing the load generated by metadata request of namenode, and better recognizing the requests mentioned above. In the whole system, Cache will be established by the use of memory database technology(Memcached) to complete mapping and prefetching relevant files, and to improve the retrieval efficiency of small files.Systematic test for improved processing scheme of small files is depicted in this paper, in which, network data as the experimental data, is designed into three test cases, and their writing-in and reading-out time, RAM consumption, storage performance are also analyzed. The test shows that the solution proposed in this paper can effectively improve the storage and reading efficiency of small files, to a certain extent, perfecting the storage performance of the Hadoop.
Keywords/Search Tags:Hadoop, the massive small file, Memcached, BM25 algorithm
PDF Full Text Request
Related items