Font Size: a A A

Research On Storage Strategy Of Distributed File System HDFS

Posted on:2016-07-22Degree:MasterType:Thesis
Country:ChinaCandidate:X Y ZhouFull Text:PDF
GTID:2308330473453590Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The rapid development and application of Internet technologyhas brought the explosive growth of data size, making large-scale data storage and processing become the hot topic in the field of high performance computing. Traditional data processing mode used for compute-intensive jobs.The emergence of distributed storage provides new solutions for the mass data storage. it "pay" a unique model, to provide users with low-cost,high-reliability,high-performancedatastorage and access online services.Therefore, how to efficient storage and access of data under the premise of ensure data availability and reliability has become a particularly critical issue of distributed storage system.Existing data replication factor decision algorithm is mainly based on the acces of of the entire file to adjustdynamically. however, many times users are only interested in part of the data block in the file, if only based on the whole file replica dynamic adjustment coefficient, may reduce the cluster storage resource utilization and increase data replica consistency maintenance overhead. Meanwhile, in some application scenarios, such as video-on-demand applications, HDFS was not aimed at those users frequently visit the hot data read optimization, frequent access to repeat data will make the DataNode node generates frequent disk I/O operations, increase the data access delay. At the same time, due to repeated transmission of data greatly wasted cluster system network traffic.In response to these problems, this thesis will focous on data replica management strategy and data access process of HDFS distributed file system, the main work is as follows:1. On the basis of in-depth analysis of the existing data replica management technology of distributed storage system distributed, proposed the strategy of dynamically adjusting replica factor base on hot data block. The strategy based on the data block rather than the entire file to implement, first of all, in the data access having temporal locality principle, by giving different weights to the each historical access cycle, predict access frequency of data block for the next cycle. Then, according to the fact that HDFS data access approximatelyconform to twenty-eight law to determine the hot data block decision threshold, thereby determining whether a single data block is hot, dynamically adjusting the replication factor of the data block based on the determination result. Finally,carry through the experiment to verify the effectiveness of the strategy.2. After analysis of HDFS file reading process, aiming at the hot data of frequently accessed in HDFS presents a DataNode local secondary cache strategy based on hot data blocks. This strategy by setting the two level cache mechanism based on local memory and local disk in DataNode node, which are respectively used for caching hot small files and large files who were frequently accessed in HDFS. The strategy to some extent to improve the data access efficiency, reduces the DataNode disk I/O load and save the cluster system network bandwidth. Finally, carry through the experiment to verify the effectiveness of the strategy.
Keywords/Search Tags:Distributed storage, HDFS, hot data, replica factor control, Localcache
PDF Full Text Request
Related items