| With the development of modern agriculture and information technology,the scale of agricultural data is increasing,the type is becoming more diversified,and the structure is becoming more complex.For these agricultural data,the agricultural small files,which account for a considerable proportion,make the big data management platform HDFS(Hadoop Distributed File System)unable to access this kind of data efficiently.Based on the fully distributed cluster environment,combined with the structural design of HDFS and the related characteristics of agricultural data,this paper conducts research from storage and retrieval respectively,and proposes an access scheme containing multi-level small file processing units based on EHDFS(Extensible HDFS)architecture,the scheme includes optimized merge storage unit based on file relevance and block space,two-level file index unit based on MapFile strategy and file pre-caching unit based on LFU(Least Frequently Used)page replacement strategy.The main research contents are as follows:(1)Optimized merge storage based on file relevance and block space.In order to solve the problems of Data Node space loss and Name Node memory pressure caused by massive agricultural small files stored in HDFS,this paper constructs a small file optimized merge storage model.Firstly,the threshold method is used to judge and divide the test files.Then,Simhash algorithm is used to detect the relevance of agricultural small files,so that high-relevance files are stored nearby to reduce the resource cost of accessing associated files.Finally,by considering the influence of the distribution status of small files with different sizes on the space occupancy rate of data blocks,a space-optimized merge storage model is constructed,which makes small files as full as possible and evenly distributed in blocks,so as to maximize the utilization of node space.(2)Two-level file index based on MapFile strategy.In the process of retrieving small files by HDFS index mechanism,in order to solve the problems of frequent information interaction and massive resource consumption,this paper constructs a two-level file index model based on MapFile strategy.By optimizing the file mapping relationship structure,improving the index constituent elements and index storage location,this model reduces the high latency and additional resource consumption caused by decentralized indexing and spanning retrieval of target files.(3)File pre-caching based on LFU page replacement strategy.In the process of retrieving hot data from large-scale datasets,in order to solve the problem of frequent I/O operations and node load between the client and the cluster nodes,this paper analyzes the impact of cache on file retrieval efficiency,and constructs a pre-caching model based on LFU page replacement strategy.The scheme relies on Edit Log,FSImage and Meta Data to obtain the index of hot data in advance,and uses the LFU strategy to replace and sort cache data items,so as to improve the cache hit rate and retrieval timeliness of small files.(4)Establishment of system architecture of EHDFS access model,experiment and result analysis.Firstly,the system architecture of small file access scheme based on EHDFS architecture is established.Then,the efficiency of EHDFS access model is verified by building fully distributed cluster environment and designing related experimental schemes.For the group access test scenario of agricultural small files,the experimental results show that,compared with HDFS and Hadoop archive,EHDFS improves the file writing time by89.36% and 8.39% respectively.Compared with HDFS and Hadoop archive,EHDFS improves memory consumption of Name Node by 97.28% and 47.62% respectively.Compared with HDFS and MapFile,EHDFS improves the file reading time by 89.03% and18.79% respectively. |