Peta-Scale I/O with the Lustre File System
The Lustre⢠file system first went into production in Spring 2003 on the Multiprogrammatic Capability Resource (MCR) cluster at Lawrence Livermore National Laboratory (LLNL). The MCR cluster was one of the largest clusters at that time with 1100 Linux compute nodes as Lustre clients. Since then, the Lustre file system has been deployed on larger systems, notably the Sandia Red Storm deployment, with approximately 25,000 liblustre clients, and the Oak Ridge National Laboratory (ORNL) Jaguar system, which is of similar scale and runs Lustre technology both with client-node Linux (CNL) and with Catamount. A number of other Lustre system deployments feature many thousands of clients. The servers used with these configurations vary considerably, with some clusters using fast heavyweight servers and others using very many lightweight servers.
The scale of these clusters poses serious challenges to any I/O system. This paper discusses discoveries related to cluster scaling made over the years. It describes implemented features, work currently in progress, and problems that remain unresolved. Topics covered include scalable I/O, locking policies and algorithms to cope with scale, implications for recovery, and other scalability issues.