Font Size: a A A

Analysis And Optimization Of Hot And Cold Data Based On Machine Learning From Access Behavior

Posted on:2021-12-03Degree:MasterType:Thesis
Country:ChinaCandidate:C Y YiFull Text:PDF
GTID:2518306104987819Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
In order to offer cost-effective large-scale storage,cloud service providers usually use hybrid storage nodes,that combine high-speed Solid State Drives(SSDs)as caches,and Hard Disk Drives(HDDs)as back-end storage.However,SSDs have the characteristics of wear-out and high write latency compared to read latency.Therefore,minimizing the SSD write data amount and reaping its read performance have become a key issue.So as to understand this problem,we collected the trace records of multiple storage nodes in the Pangu storage system of Alibaba.Through analyzing these traces,we found the following three characteristics:(1)Some nodes have a high proportion of read requests.(2)Lots of data are only accessed once for a long time.(3)The request behavior has a certain correlation with its attributes.Therefore,accurately identifying the current hot and cold situation of the file and filtering out the read access behavior to the cold data can reduce the amount of SSD writes and improve the cache hit rate.At the same time,prefetching the predicted hot write data can further improve the cache performance.Based on the above ideas,a cold and hot data classification cache mechanism based on machine learning is presented,which takes chunks as classificed objects and distinguishes them according to their access history priori.First,the size and time of file block are extracted from trace records.The iterative method is used to update the historical reuse distance of file blocks,and is saved as much information as possible in the case of constant complexity.Then we use Decision Tree(DT)and other seven traditional machine learning algorithms,and Convolutional Neural Networks(CNN)and other two deep learning algorithms to classify the information.It is found that XGBoost(e Xtreme Gradient Boosting)is the best integrated learning method.Finally,the optimal classification threshold is determined according to the size of cache space and the average size of file blocks.Experimental results show that the hot and cold data classification cache mechanism can effectively improve cache hit rate and reduce SSD writes.After applying this strategyto LRU cache algorithm,the cache hit rate is increased by 4.84% on average,and SSD write rate is reduced to 21.69% on average.
Keywords/Search Tags:Access Behavior Analysis, Machine Learning, Deep Learning, Data Heat Prediction
PDF Full Text Request
Related items