Font Size: a A A

Research On Optimization Method Of File Access Based On Hadoop

Posted on:2021-02-05Degree:MasterType:Thesis
Country:ChinaCandidate:W J SunFull Text:PDF
GTID:2428330605955967Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the rise of big data technology,the data generated by users' access to the Internet is growing exponentially,and most of these data are most in small files.Traditional storage technology has greatly reduced the performance of processing large amounts of small files,and the Hadoop distributed architecture provides a good solution for processing large amounts of data.Hadoop itself shows high performance in processing large files,but with the increase in the number of small files,the storage space of the NameNode metadata block in Hadoop memory is excessive,and the performance of accessing files is reduced.Therefore,this paper designs an efficient solution to Hadoop's optimization of small file storage,which is also an important direction to be studied in this paper.Aiming at the problem of low storage efficiency of small files in Hadoop,this paper designs an associated file merge algorithm and file cache algorithm.In the associated file merging algorithm,an associated file mining model algorithm is first designed on the small file preprocessing module,that is,the K-nearest neighbor algorithm based on TF-IDF feature extraction and weighted cosine similarity measurement.Set the word stem extraction,word segmentation and other operations to find the category of the training file to be classified and the test files of the K neighbors;then through the experimental test to find the K neighbors most similar to the file to be classified in the many training file sets,And implement the clustering;finally get the classified test file set,and merge the classified file set.In the merging algorithm,small files are stored in the form of <key,value> key-value pairs.By reading the file path name and file content,they are merged and uploaded to the HDFS memory space.The file caching algorithm proposes a LRU-K file cache elimination algorithm by improving the deficiencies of the LRU and LFU algorithms.The key point of this algorithm is to eliminate unused files by setting the time stamp and file access frequency for users to access small files,Improve the cache performance of commonly used files,and further increase the user's access hit rate for reading files.In order to verify the feasibility of the algorithm,this article conducted multiple experiments on the built Hadoop cluster,compared with the original HDFS file storage scheme and HAR archive scheme,and tested the performance in the NameNode memory occupancy rate and time spent writing files.The experimental results prove that the small file optimized access scheme designed in this paper can effectively reduce the memory consumption of NameNode in HDFS,and effectively reduce the time and time for users to read and write small files,and further verify the feasibility of the proposed scheme in this paper.
Keywords/Search Tags:Hadoop, Nearest neighbor recommendation Algorithm, Association file merge, LRU-K
PDF Full Text Request
Related items