Font Size: a A A

Reading And Writing Strategy Research Of Massive Small Files Based On HDFS

Posted on:2017-05-04Degree:MasterType:Thesis
Country:ChinaCandidate:Z B GaoFull Text:PDF
GTID:2308330488451992Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Nowadays, with the rapid development of Internet information and cloud computing technology, the Internet data is gradually dominated by users instead of the site manager, which makes people can produce or access huge amounts of data via Internet services anytime and anywhere. How to effectively manage these massive personal or public data has become a top priority. The traditional storage architecture has a poor performance for current Internet data storage services and its disadvantages such as poor scalability, low security, high maintenance costs, awful disaster recovery become apparent increasingly. The distributed cloud storage platform that makes personal data stored to cloud service and centralized management receives extensive attention of the IT industry, and users do not need to hold a large amount of local storage space, instead, they can easily access to their data in the cloud storage only by intelligent devices. The distributed storage architecture can solve the above problems remarkably and meet the needs of large-scale concurrent access by users.Hadoop is an open source distributed system which is designed to run on cheap hardware and one of its core, HDFS, as a new cloud storage, can handle the storage problem of explosive data well. For the scenario of read-write of massive small files, this thesis analyzes HDFS in detail and puts forward an improvement of HDFS, called RCHDFS, designed to run on Redis cluster to solve existing problems of HDFS such as small file problem, node selection and access cache.Firstly, this thesis studies several typical distributed file systems like GFS, MooseFS and HDFS by analyzing their basic system compositions and working principle. The main components and dependencies of HDFS are carried on, the working mechanisms and corresponding source code achievements of NameNode, DataNode and DFCSClient are parsed in detail.Secondly, through the analysis of many Chinese and foreign literature and the research of related technology research, aiming at the inherent problems of HDFS, the improved scheme is put forward after analyzing the existing solutions, which is divided into three parts. First Part is a Redis cluster service method, which is deployed in DataNodes, to assume most of the NameNode metadata management tasks, which makes the memory usage evently distributed in different DataNodes and alleviates memory consumption and concurrent pressure. Second part is a selection of optimal storage nodes based on server load performance and balanced distribution of data block, to optimize rack awareness strategy of HDFS, which not only guarantees the balance of block but reduces the read-write latency. Third part is a access method of medium and small-sized file based on hybrid caching, which allows the hot small files cached in Redis and middle meta data cached in client, to further enhance the access efficiency of big-scale file.Finally, a contrast experiment is designed and implemented between the native HDFS and proposed RCHDFS. The result shows that our approach can significantly reduce the memory consumption of metadata while storing big-scale file and effectively lower read-write latency while high concurrent operation happening. In addition, it can ensure that the distribution of all blocks and metadata are balanced.
Keywords/Search Tags:Cloud storage, HDFS, Redis Cluster, Small File Problem
PDF Full Text Request
Related items