Research On Key Issues In Large Scale Clustered File System Lustre

Posted on:2012-10-18

Degree:Doctor

Type:Dissertation

Country:China

Candidate:Y J Qian

Full Text:PDF

GTID:1118330341451759

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

The cluster architecture has been matured as the mainstream architecture for high-performance computers. Clustered file system is a key technology to easy the I/O bottleneck problem of HPC clusters. With the continuing development of HPC technologies, the storage demand for HPC applications keeps increasing. Lustre is the leading clustered file system, and it has become the standard to construct HPC storage systems with largest market share in HPC. Lustre effectively scales to support systems with tens of thousands of compute nodes and has proved aggregative I/O performance and scalability. As HPC systems increase node counts to increase overall performance, future HPC clusters will become extreme large. This brings serious challenges for Lustre especially in scalability, I/O performance and availability. The work in this thesis mainly focuses on these problems. The crucial contributions are as follows.(1) According to the parallel I/O access characteristic of large scale applications, this thesis presents a novel server-side network request scheduler framework for a large-scale, LustreTM storage cluster system. Based on it, it proposes an Object Based Round Robin (OBRR) scheduling algorithm that reorders the execution of I/O requests, presenting a workload to the backend storage that can be optimized more easily. In the meanwhile, to avoid starvation and meet the requirement of response time for I/O requests with different urgencies, it proposes a novel two-level deadline setting strategy - a dynamic deadline and a mandatory deadline. Via a series of experiments using the Lustre simulator scaling up to thousands of nodes, it demonstrates that the I/O performance increases as high as 40% by using OBRR algorithm and the two-level deadline setting strategy can maintain fairness, avoid starvation and ensures the response time requirement for I/Os with different urgencies.(2) Similar to network congestion, it will also cause I/O congestion problem when the storage cluster scales up to extreme large size. This thesis proposes a dynamic I/O congestion control mechanism to support the incoming exascale HPC systems. Under its control, the clients are allowed to issue more concurrent I/O requests to the server, which optimizes the utilization of the network/server resources and improves the I/O throughput, when the server is under light load; on the other hand, it can throttle the clients'I/O and limit the number of I/O requests queued on the server to control the I/O latency and avoid congestive collapse, when the server is under overload. The results from series of evaluation experiments in Tianhe-1 super computer demonstrate the effectiveness of our I/O congestion control mechanism. It prevents the occurrence of congestive collapse; on this premise it performs a best-effort approach and maximizes the I/O throughput for the scalable Lustre file system.(3) To solve the problem of the fixed timeout mechanism emerging in large scale HPC cluster systems, this thesis proposes an adaptive scalable RPC timeout mechanism that considers network conditions, server loads, scalability and performance. The mechanism includes two strategies: adaptive timeout strategy and early reply strategy. In the adaptive timeout strategy the timeout value set by clients is adapted and adjusted in a dynamic fashion according to the network conditions and server workload to accommodate the environment changes, reducing performance degradation of the entire system caused by ineffective timeouts; To distinguish the server congestion from a failure of the server or network, and to resolve the nested timeout problem, it proposes an early reply strategy: the server notifies the client to wait for an extra amount of time for a response to an RPC that is about to time out by a light-weight early reply message passing. It further avoids the occurrences of unnecessary timeouts and enhances the system responsiveness. A series of simulation experiments demonstrate that: compared with fixed timeout mechanism, the RPC timeout rate drops from 76% to 13% using the adaptive timeout strategy, and it even drops to 0% combined with the early reply strategy; in RPC-based large scale clusters, existing mechanisms for the RPC failure detection, such as client-driven polling and probing, generate considerable amount of unnecessary network traffic and have scalability problem, while our mechanism generates much less extra network traffic and it is a more scalable failure detection mechanism for RPC models with timeouts .(4) This thesis researches Lustre distributed lock manager technology. First, it analyzes concurrent control mechanism for file access, and client-side dentry cache and data writeback cache based on the lock callback; Second, it researches the metadata operations based on intent locks, sub tree lock mechanism and file size acquiring algorithm based on extent locks; At last, it proposes adaptive I/O locking strategy, optimized conflict check strategy for extent locks based on interval tree and lock discarding strategy, and these proposed strategies further improve Lustre's I/O performance and scalability of Lustre's lock service.(5) This thesis researches transactional metadata update algorithm and recovery mechanism for the stateful Lustre. Lustre allows the server to return the result of metadata transaction to the client when finished the memory update, and the result is visible in the whole namespace. By this way, it can provide good metadata performance, but it will cause cascade abort problem during reboot recovery (or failover), making recovery transparent impossible. Lustre reboot recovery algorithm needs that all clients reconnect to the server in a special recovery time window, and then clients resend uncommitted transactional requests and the server replays these requests strictly in the transaction number order. The recovery conditions are too strict. To improve Lustre's recoverability, this thesis proposes version based recovery and commit on share algorithms. They extend Lustre's metadata update algorithm and recovery algorithm respectively and allow clients rejoin in the cluster by recovery under a more relaxed condition. The version based recovery algorithm adds version check during recovery, and the transactions with version match are allowed to replay. The commit on share algorithm forces to commit the inter-client dependent transaction to disk once detect, to avoid reading or writing the data of uncommitted transactions. It eliminates the inter-client recovery dependencies and clients are allowed to recovery independently. Experiment evaluation demonstrates that the commit on share algorithm has effect on performance due to mandatory disk commits when detect inter-client dependencies. However, in a very large scale Lustre cluster, commit on share functionality is usually enabled to provide high reliable, high available service.

Keywords/Search Tags:

Lustre, HPC, I/O Schedule, QoS, Scalability, congestion control, failure detection, distributed lock, concurrent control, recovery, high availability

PDF Full Text Request

Related items

1	Improving Availability With Fine-grained Failure Detection And Recovery
2	The Research And Implementation Of Failure Detection And Recovery Of The Forces Control Elements
3	Research And Implementation On Disaster-recovery Oriented Failure Detection Algorithm
4	Research Of Distributed Database Resource's High Availability
5	An Automatic I/O Congestion Control Mechanism Based On Deep Q-learning
6	Design And Implementation Of High Availability Service Fault Recovery System For OpenStack
7	Research On Control Layer Failure Detection And Recovery Algorithm In SDN Framework
8	Research And Implementation On High Availability Of Forces Control Element
9	Research And Development On Distributed Redundancy Protocol Switch
10	Research On NVM-oriented Lustre Persistent Cache On Client