Font Size: a A A

Research On Hot File Identification And Application Technology Based On Natural Language Processing

Posted on:2020-04-26Degree:MasterType:Thesis
Country:ChinaCandidate:H ChenFull Text:PDF
GTID:2518306548994409Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Tierd storage is an import technology in the field of computer storage,which stores data in multiple tiers,masking latency and increasing throughput through file classification and data migration techniques.In fact,the main purpose of hierarchical storage management is to correctly classify files in runtime.When dealing with multi-tired storage system,this mission can be splitted into several sub-missions of binary classification.However,the correctness of file classification heavily depends on the understanding of the I/O behavior.Our research focuses on three aspects.The vectorized representation of files is proposed in this paper.With this method,we traverse the tree-like directory of the file system to generate a corpus,and use the word embedding model to map the files into high-dimensional vectors,forming a new form of metadata,which provides a quantitative basis for modeling and analyzing I/O behavior.Based on the vectorized representation method of files,a hot and cold file classification model based on recurrent neural network is established in this paper.In order to verify the rationality of the model,this paper designed a file classification experiment based on application compilation as the target workloads.The experimental results show that for single-process compilation tasks,after the parameter tuning and classification threshold selection,a high hot-file recognition rate can be achieved,and the cold data is controlled to a lower level by the false positive rate.To implement the models in the real file systems,this paper designs a data migration framework based on the Gluster FS file system.The framework has improved the design of the Tiering module,the CTR module,and the GFDB database,including file vectors as metadata into the GFDB database,and embedding a recurrent neural network-based file classification model into the CTR module as a supplement to the LRU cache algorithm,providing data migration strategies for the hierarchical file systems.
Keywords/Search Tags:Tiered Storage, Data Migration, Access Pattern, Word Embedding, RNN
PDF Full Text Request
Related items