Research On HDFS Small File Access Method Based On Frequent Item Set

Posted on:2020-04-02

Degree:Master

Type:Thesis

Country:China

Candidate:L Liu

Full Text:PDF

GTID:2428330572481091

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

The Hadoop distributed file system is far and wide employed for the file storage field.With the improvement of network technology and the growth of user demand,a large amount of small file data is generated.At present,the shortcomings of HDFS storing a large number of small files include the following two aspects: the storage space of the NameNode is huge when storing a large number of small files;the frequent communication of the client with the NameNode causes the I/O performance of the NameNode to decrease.Therefore,how to use HDFS to efficiently store and manage a large number of small files and achieve efficient and accurate search and access is an urgent problem to be solved.It has also become an important research direction today.This paper proposes a new solution and evaluates its feasibility in the HDFS storage of large files.In order to solve the problem of HDFS massive file storage,this paper designs an associated file merging algorithm and file caching and updating algorithm.The associated file merging algorithm improves the Apriori algorithm,converts the original data set into a transaction matrix,and increases the transaction count and item count to further compress the matrix,which can effectively reduce the number of times the Apriori algorithm traverses the transaction database and reduces the number of times.The I/O overhead improves the execution efficiency of the algorithm.By combining the strongly associated files,you can prepare for reducing the request file access time during the file read phase.Then,a small file-based file merging algorithm is designed,which can effectively alleviate the internal fragmentation and uneven file volume distribution in the data block of HAR archive technology.The file caching and updating algorithm aims to reduce the communication between the client and the NameNode by adding a file caching strategy,and extract the files that may be accessed next to the buffer according to the predicted file sequence,if the file requested by the user is in the buffer area.The file is directly returned to the user,which can reduce the number of communication between the client and the NameNode,improve the file reading efficiency,and design a file based on the long short term memory network model to update the file sequence and improve the predicted file.The accuracy of the sequence.Experiments show that the proposed solution can effectively improve the utilization of DataNode data blocks and reduce the memory consumption of NameNode during the file storage phase.It can reduce the number of communication between the client and the NameNode during the file reading phase,and improve the file reading efficiency.It shortens the user's access timeand proves the feasibility and effectiveness of the small file access optimization scheme proposed in this paper.

Keywords/Search Tags:

HDFS, Apriori algorithm, Small file merging, LSTM

PDF Full Text Request

Related items

1	The Research Of HDFS Optimization Towards Lots Of Small Files Accessing And Storage
2	Research On Small File Access Technology Based On Hadoop
3	Research And Application Of The Optimization Strategy Of File Storage And Reading Based On HDFS
4	Research And Implementation Of Small File Storage Model Based On HDFS
5	Optimization Study On Storing Massive Small Files Based On Hadoop
6	Research And Implementation Of Mass Small File Based On HDFS
7	Optimization And Implementation Of Small File Storage In HDFS Under Hadoop Platform
8	The Optimization Technology And Application Of Massive Small File Access Based On HDFS
9	Improvement Of HDFS Small File Storage Based On Har
10	Research On Mass Small File Storage Technology Based On Hadoop