Font Size: a A A

Research Of Improving Storage Of Replica And Small Files Merging And Access Optimization On Hadoop Platform

Posted on:2016-07-09Degree:MasterType:Thesis
Country:ChinaCandidate:Y W LiFull Text:PDF
GTID:2348330476455745Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In recent years, the "big data" technology has gradually become a hot issue attention in academia and industry area. As one development platform for big data processing, Hadoop not only brings us the ability to handle large data cheaply, but also realized the source of the code. As the lowest level of the Hadoop's distributed file system, HDFS stores all the data of storage nodes in the cluster, which has the feature of high fault tolerance and high throughput, in addition to providing efficient read and write performance for MapReduce. However, HDFS uses a serial pipelined storage for the storage design of multiple copies, which restricts the copy of the storage performance in HDFS. Meanwhile, with the continuous development of Internet technology, the data of massive small files are gathering rapidly, Hadoop has severely restricted its access performance in dealing with massive small files because of the adhering to the design concept of massive storage of large files. In response, this thesis carries an in-deep research into this two issues. The main research content and innovations in this article are as follows:Firstly, according to the storage inefficiencies in HDFS which is caused by serial storage of replica and the other researchers proposed parallel storage method, this thesis puts forward a new design. With the optimization of the design thought, this thesis conducts a deep analysis of the storage infrastructure under HDFS and the structure of related classes and data block in depth and detail, to offer improvement. It presents the parallel storage of replica, by creating socket connections with all DataNode in pipeline.Secondly, according to the I/O performance in Hadoop which is seriously restricted by massive small files, this thesis puts forward the small files reading scheme which is based on the B+Tree index and the foundation of a built-in SequenceFile merger proposal under Hadoop. Improving the query rate of small files, while reducing the occupancy rate of metadata from small files on NameNode memory space in order to improve the productivity of small files. In order to achieve the program, firstly, we give the design of the B+Tree index structure, and secondly, we do a detailed analysis and implementation aiming at the construction and search function of the B+Tree index. Finally, combined with the analysis of HDFS files reading process, we achieve a reading progress of small files under SequenceFile based on B+Tree index.Finally, we set up a cluster system of Hadoop, to respectively validate that the two design schemes are effective in the read rate of files and storage rate of small files by doing simulation experiment.
Keywords/Search Tags:Hadoop, HDFS, Serial Storage, Parallel Storage, B+Tree, SequenceFile
PDF Full Text Request
Related items