Font Size: a A A

The Research On Key Technologies Of SSD-based Storage System

Posted on:2020-12-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:E C XuFull Text:PDF
GTID:1368330611492979Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Recently,reports suggest that around 60%big data related investment is fo-cusing on storage systems and their related infrastructures.As one fundamental component of the modern storage systems,SSD(Solid State Drive)is particularly favored for its high performance and low power draw.However,compared to tradi-tional spinning disks,the SSD and the storage systems that are built upon it may not meet the critical demands of consumers on performance and reliability.Especially,users are concerned about:(1)What are root causes of the SSD-related failures in the storage systems?(2)How does the each internal component interact with each other and finally contribute to failures?(3)How to exploit the performance of SSDs for new frameworks and applications?Yet,in order to answer these questions,researchers and developers may face great challenges.Regarding reliability,the flash drive can introduce new and unique errors(e.g.,Nand Erase Error),hence the internal firmware and upper level software stack need to be adapted for these reliability challenges.Moreover,as the scale of the system increases rapidly,the difficulty of understanding the reliability quickly grows as well.Furthermore,developers are also facing challenges in optimizing the SSD's performance under newly-developed frameworks and applications(e.g.big data framework and deep learning applications).Due to their unique data commu-nication and computation characteristics,traditional strategies and scheduling may not work well for SSD-based storage systems.Therefore,in this dissertation,we target two important aspects of SSD-based storage systems:the reliability and high performance,which includes four major contributions:First,by collecting a large amount of data in the whole life cycle of the SSD,the main reason for the fault prediction accuracy of the SSD under the single node is not high.It is pointed out that the existing circulatory measurement tool can not accurately reflect the actual durability of the SSD.An automated test and monitoring framework is needed to monitor and analyze the entire process from the initial state to the end of life of the SSD.Based on the monitoring of the entire life cycle of SSDs,many innovative error modes have been discovered.Based on the above work,the model-based solid state hard disk life prediction system-iLife is designed.According to the actual solid-state hard disk usage statistics,iLife can quickly and accurately measure the problems in the whole life cycle of the solid state hard disk,and is superior to the existing one based on rubbing.Write the wear and tear(i.e.,P/E cycle)SSD life prediction system.Second,multi-node SSD storage system has higher complexity and more un-certainty than single-node SSD storage system.We use Alibaba Cloud's seven data centers as background to study multi-node cloud storage system,involving 450,000.Block hard drives and three years of historical data,totaling more than 100,000.We combed the research to be solved into the following three questions,one of the problems:How many faults are related to SSDs in the storage systems?What are their characterizations?The conclusion of this dissertation is that 7.8%of the faults are caused by SSDs,which are node startup failure,file system unavailability,disk loss,cache error and media error.Problem 2:All failures related to SSD storage systems If it is not caused by the SSD itself,what other reasons?The conclusion of this dissertation is that approximately 34.4%of SSD-related errors are not related to the SSD itself.An unstable connection can also cause a solid state drive to fail.Regarding the relationship between device layer errors and unstable connections,UCRC errors can be utilized as a criterion for judgment.The third problem:In addition to hardware errors caused by the solid state hard disk,in addition to other aspects of the system indirectly affect the failure?The conclusion of this disser-tation is:SSD failures and errors are affected by cloud services,and block storage services can cause serious SSD imbalance problems.The current placement of SSDs within nodes and between nodes is not optimal,resulting in three types of thermal anomalies,which can lead to up to 58%raw bit errors.Different error correction methods and active traversal methods are needed to reduce errors caused by passive hard disk heat generationThird,as we all know,a memory-based distributed big data platform(such as Apache Spark)is the current mainstream big data processing framework.In Spark,RDD only supports coarse-grained caching,and there are strict restrictions on the type of cached data.These limitations result in low memory utilization and large amounts of data being written to high-latency back-end storage that are not well suited to the needs of different workloads.It is even necessary to rely on the pro-grammer to manually make these cache location and timing decisions,and therefore can not adapt to the real-time dynamic changes of the program.We have proposed a Cache strategy-Neutrino.In Neutrino,users can implement fully automated fine-grained cache allocation strategies for different workflows.In the implemen-tation process,the data flow of the program operation is first obtained;then,the optimal cache strategy is obtained from the data flow through the dynamic planning strategy;finally,the RDD is fine-grained and deployed according to the plan.We implemented a prototype system in Spark.After testing Neutrino's performance in various experimental environments and four different workflows,it is superior to the traditional Spark caching strategy.Lastly,distributed deep learning is a hotspot application in current large data storage and processing systems.However,deploying deep learning applications in large data clusters is difficult.First,in order to get the best performance configura-tion,you often need to manually configure a large number of options.At the same time,due to the particularity of deep learning about configuration,simply using sim-ilar tools does not directly solve the problem.For the abstraction of performance bottlenecks,diversified resource requirements and allocation problems,we designed and implemented a distributed automatic learning prototype system based on dis-tributed deep learning-Dike.First,by capturing the dynamic information of deep learning tasks,covering model details and cluster configuration parameters and clas-sifying resource configuration problems into a knapsack problem.This dissertation designs a reasonable value judgment function to determine the final configuration and node deployment plan.The experimental results show that Dike can achieve the best strategy of about 95%and almost no labor costs.
Keywords/Search Tags:Flash, SSD, Storage System, Reliability, High Performance, Big Data, Deep Learning
PDF Full Text Request
Related items