Font Size: a A A

Research And Optimization Of Distributed Storage Based On HDFS

Posted on:2018-12-08Degree:MasterType:Thesis
Country:ChinaCandidate:M M ZhouFull Text:PDF
GTID:2348330542492614Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Distributed file system(HDFS)has the advantages of high expansibility and strong fault tolerance.HDFS can be deployed on cheap devices,which has strong ability of data processing.The system structure of HDFS is thoroughly analyzed in this thesis.Then the design defects of HDFS is pointed out.This thesis proposes a scheme to improve its ability of processing massive small files and replica placement.HDFS is designed for processing large files,which means there has some shortages in processing massive small files.It may take up a lot of NameNode memory when store massive small files on HDFS.When users access small files frequently,it need to continue to access the NameNode,and DataNode node switching,resulting in low data access efficiency.In order to solve the storage problem of massive small files on HDFS,this thesis proposes a storage optimization strategy for small files.The strategy classifies the associated small files based on the VSM vector space model.Then,merged into large files uploaded to the HDFS cluster.In the process of classification,we can segment the document,extract the characteristic words,and then merge them into large files.Establish metadata cache and index for large files,this strategy can reduce the memory consumption of NameNode,improve the speed of reading small files.HDFS does not take into account the heterogeneity of DataNodes.If every node stores equal numbers of data,which may causes the unbalanced load distribution,because every node have different performance.To solve this problem,this thesis proposes a replica placement strategy based on the evaluation of nodes.The strategy provides an interface that allows users in the cluster to customize the node's load information and the weight.It optimizes the TOPSIS algorithm and evaluates the nodes with the optimized algorithm.It considers evaluation and the network distance at last.This strategy finally selects the best node to store the data block.So it can balance the load of each node and improve the overall system performance.Through the experiment,the two improved strategies are compared with the existing scheme.The results show that the proposed scheme improved the overall performance of HDFS.
Keywords/Search Tags:HDFS, Small File Storage, Replication Placement
PDF Full Text Request
Related items