Font Size: a A A

Research On HDFS Small File Access Method Based On Frequent Item Set

Posted on:2020-04-02Degree:MasterType:Thesis
Country:ChinaCandidate:L LiuFull Text:PDF
GTID:2428330572481091Subject:Engineering
Abstract/Summary:PDF Full Text Request
The Hadoop distributed file system is far and wide employed for the file storage field.With the improvement of network technology and the growth of user demand,a large amount of small file data is generated.At present,the shortcomings of HDFS storing a large number of small files include the following two aspects: the storage space of the NameNode is huge when storing a large number of small files;the frequent communication of the client with the NameNode causes the I/O performance of the NameNode to decrease.Therefore,how to use HDFS to efficiently store and manage a large number of small files and achieve efficient and accurate search and access is an urgent problem to be solved.It has also become an important research direction today.This paper proposes a new solution and evaluates its feasibility in the HDFS storage of large files.In order to solve the problem of HDFS massive file storage,this paper designs an associated file merging algorithm and file caching and updating algorithm.The associated file merging algorithm improves the Apriori algorithm,converts the original data set into a transaction matrix,and increases the transaction count and item count to further compress the matrix,which can effectively reduce the number of times the Apriori algorithm traverses the transaction database and reduces the number of times.The I/O overhead improves the execution efficiency of the algorithm.By combining the strongly associated files,you can prepare for reducing the request file access time during the file read phase.Then,a small file-based file merging algorithm is designed,which can effectively alleviate the internal fragmentation and uneven file volume distribution in the data block of HAR archive technology.The file caching and updating algorithm aims to reduce the communication between the client and the NameNode by adding a file caching strategy,and extract the files that may be accessed next to the buffer according to the predicted file sequence,if the file requested by the user is in the buffer area.The file is directly returned to the user,which can reduce the number of communication between the client and the NameNode,improve the file reading efficiency,and design a file based on the long short term memory network model to update the file sequence and improve the predicted file.The accuracy of the sequence.Experiments show that the proposed solution can effectively improve the utilization of DataNode data blocks and reduce the memory consumption of NameNode during the file storage phase.It can reduce the number of communication between the client and the NameNode during the file reading phase,and improve the file reading efficiency.It shortens the user's access timeand proves the feasibility and effectiveness of the small file access optimization scheme proposed in this paper.
Keywords/Search Tags:HDFS, Apriori algorithm, Small file merging, LSTM
PDF Full Text Request
Related items