Research Of Improving Storage Of Replica And Small Files Merging And Access Optimization On Hadoop Platform

Posted on:2016-07-09

Degree:Master

Type:Thesis

Country:China

Candidate:Y W Li

Full Text:PDF

GTID:2348330476455745

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

In recent years, the "big data" technology has gradually become a hot issue attention in academia and industry area. As one development platform for big data processing, Hadoop not only brings us the ability to handle large data cheaply, but also realized the source of the code. As the lowest level of the Hadoop's distributed file system, HDFS stores all the data of storage nodes in the cluster, which has the feature of high fault tolerance and high throughput, in addition to providing efficient read and write performance for MapReduce. However, HDFS uses a serial pipelined storage for the storage design of multiple copies, which restricts the copy of the storage performance in HDFS. Meanwhile, with the continuous development of Internet technology, the data of massive small files are gathering rapidly, Hadoop has severely restricted its access performance in dealing with massive small files because of the adhering to the design concept of massive storage of large files. In response, this thesis carries an in-deep research into this two issues. The main research content and innovations in this article are as follows:Firstly, according to the storage inefficiencies in HDFS which is caused by serial storage of replica and the other researchers proposed parallel storage method, this thesis puts forward a new design. With the optimization of the design thought, this thesis conducts a deep analysis of the storage infrastructure under HDFS and the structure of related classes and data block in depth and detail, to offer improvement. It presents the parallel storage of replica, by creating socket connections with all DataNode in pipeline.Secondly, according to the I/O performance in Hadoop which is seriously restricted by massive small files, this thesis puts forward the small files reading scheme which is based on the B+Tree index and the foundation of a built-in SequenceFile merger proposal under Hadoop. Improving the query rate of small files, while reducing the occupancy rate of metadata from small files on NameNode memory space in order to improve the productivity of small files. In order to achieve the program, firstly, we give the design of the B+Tree index structure, and secondly, we do a detailed analysis and implementation aiming at the construction and search function of the B+Tree index. Finally, combined with the analysis of HDFS files reading process, we achieve a reading progress of small files under SequenceFile based on B+Tree index.Finally, we set up a cluster system of Hadoop, to respectively validate that the two design schemes are effective in the read rate of files and storage rate of small files by doing simulation experiment.

Keywords/Search Tags:

Hadoop, HDFS, Serial Storage, Parallel Storage, B+Tree, SequenceFile

PDF Full Text Request

Related items

1	Research And Application Of Data Storage Method Based On Hadoop
2	Research Of Data Storage And Management On Huatu Online Library System Based On HDFS
3	The Technical Research Of Optimization Of File Storage In HDFS
4	Application And Research On Data Storage Of Rail Transit Maintenance Support System Based On Hadoop
5	Design And Implementation Of Cloud Security Storage System Based On Hadoop
6	Research On Storage Strategies And Optimization Hadoop Platform
7	Research And Design On Hadoop-based Cloud Storage Platform Of New Campus
8	Design And Implementation Of Massive Audio File Storage System Based On HADOOP
9	Research And Optimizing Of Data Storage Under HDFS
10	Research On Data Storage Method Based On HDFS And Implementation In Building Big Data Platform Of Industry