Research And Application Of Massive Small Files Processing Techniques Based On Hadoop

Posted on:2016-08-15

Degree:Master

Type:Thesis

Country:China

Candidate:Y F Yao

Full Text:PDF

GTID:2308330473965473

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the arrival of the age of big data, the data size only produced in China per year has reached ZB level, thus bringing increasing difficulties to the traditional database technology and system for their storage and processing. Hadoop, a data processing platform which developed by Apache Software Foundation in recent years, is an open-source and distributed data processing framework with advantages of reliability, scalability, efficiency, extensibility and low cost in processing big data. HDFS, as the file system of Hadoop, inherits the advantages of traditional storage system, and functions by its pioneering design ideas to deal with massive data, making itself become a storage and transmission system for extended files. However, the emergence of a series of performance bottlenecks such as high occupancy of namenode, low efficiency to access and retrieval in processing a mass of small files by Hadoop have naturally made it an important research topic in the academic and industrial circles.To solve the above problems, this paper mainly aims to reduce the number of small files in way of classification and combination after determining their relevance by BM25 algorithm, so as to alleviate the memory bottlenecks in metadata storage of small files caused by namenode, and establish an index mechanism with file combination, thus realizing the access to both collected files and a single file. In addition, the prefetching mechanism for related small files and their index will be respectively provided according to file correlation, which will be conducive to reducing the load generated by metadata request of namenode, and better recognizing the requests mentioned above. In the whole system, Cache will be established by the use of memory database technology(Memcached) to complete mapping and prefetching relevant files, and to improve the retrieval efficiency of small files.Systematic test for improved processing scheme of small files is depicted in this paper, in which, network data as the experimental data, is designed into three test cases, and their writing-in and reading-out time, RAM consumption, storage performance are also analyzed. The test shows that the solution proposed in this paper can effectively improve the storage and reading efficiency of small files, to a certain extent, perfecting the storage performance of the Hadoop.

Keywords/Search Tags:

Hadoop, the massive small file, Memcached, BM25 algorithm

PDF Full Text Request

Related items

1	Research And Implementation Of Hadoop Small File Processing Technology
2	Optimization Study On Storing Massive Small Files Based On Hadoop
3	Research On Storage Strategy Of Massive Small Files Based On HDFS
4	The Research On Massive Small Files Processing Under The Hadoop
5	Research And Implementation Of Small File Processing Techniques In Hadoop
6	Research On Access Strategy Of Massive Small Files Based On Hadoop
7	Optimization Of Massive Small Files On Hadoop Cluster
8	Research On Small File Storage Mechanism For Hadoop
9	Research On Processing Techniques Of Massive Small Files Based On Hadoop
10	The Design And Implementation Of Massive Small Files Storage System Based On HDFS