Font Size: a A A

I/O Resource Monitoring And Diagnosis System For The Sunway TaihuLight

Posted on:2019-10-27Degree:MasterType:Thesis
Country:ChinaCandidate:B YangFull Text:PDF
GTID:2428330545453682Subject:Computer technology
Abstract/Summary:PDF Full Text Request
This paper presents an effort for overcoming the complexities of production-use I/O performance monitoring and the difficulties of I/O interference analysis and optimization on the current No.1 supercomputer Sunway TaihuLight.We design,implement,and deploy an end-to-end I/O resource monitoring and diagnostic system named Beacon for Sunway TaihuLight,which simultaneously collects and correlates I/O Traces and profiling data from all the compute nodes,forwarding nodes,storage nodes and metadata nodes.Mechanisms including aggressive on-line Trace compression are proposed to facilitate scalable,low-overhead and sustainable I/O diagnostics under production use.Higher-level per-application I/O performance behaviors are reconstructed from system-level I/O performance data to give more insights on correlations between system performance bottlenecks,utilization symptoms and application behaviors,than isolated performance diagnostics from black-box lower-level logs.Examinations enabled by Beacon in its several months of deployment so far have led to optimizations implemented or planned to improve application performance,enhance system resource utilization,and reduce inter-workload I/O interference.E.g.,we identified that major I/O performance interference may not necessarily come from high I/O throughput applications,but from applications issuing inefficient I/O Requests,simultaneously incurring high-contention and low-utilization.We propose several Beacon-assisted optimizations,including avoiding N-1 mode in applications,Request priority-adjustment,I/O-aware application isolation,extra I/O forwarding node allocation,abnormal storage node removal and grouped I/O,and demonstrate their effectiveness through large-scale application and benchmark evaluations.The major contributions of this paper are as follows:?We have designed,implemented,and deployed a lightweight end-to-end I/O resource monitoring and diagnostic system,Beacon,for the current No.1 supercomputer.Beacon collects and correlates I/O related performance data from compute nodes,I/O forwarding,storage and metadata nodes in real-time.? We have devised mechanisms such as aggressive online Trace data analysis and compression to facilitate scalable and sustainable I/O monitoring under production use.Consequently,we obtain per-application-level information to complement black-box lower-level logs,thereby connecting system performance bottlenecks and utilization symptoms to application behaviors.This practically helps application developers and system administrators to identify and locate I/O performance issues in Sunway TaihuLight.? Through our deployment of Beacon,we have observed that major I/O performance interference may not necessarily come from high I/O throughput applications(as assumed by previous studies),but from applications issuing inefficient I/O Requests,while simultaneously incurring high contention and low utilization.? Based Beacon-provided insight,we propose practical optimizations including avoiding N-1 mode,Request priority adjustment,I/O-aware application isolation,extra I/O forwarding node allocation,abnormal storage node removal and grouped I/O,and demonstrate their effectiveness with real-world applications.
Keywords/Search Tags:I/O monitoring, Optimization, interference, Sunway TaihuLight, deployed system
PDF Full Text Request
Related items