The use of clusters of independent compute nodes as high capability
and capacity computers is rapidly growing in industry, academia, and
government. This growth is accompanied by fast-paced progress in cluster-aware
hardware, and in particular in interconnection technology. Contemporary
networks offer not only excellent performance as expressed by latency
and bandwidth, but also advanced architectural features, such as programmable
network interface cards, hardware support for collective communication
operations, and support for modern communication protocols such as
MPI and RDMA.
The rapid progress in cluster hardware and usage is unfortunately
not matched by similar progress in system software. This software
consists of the middleware: the operating system, user libraries,
and utilities that interface between the hardware and the user applications,
allowing them to make use of the machine's resources. In fact, most
of these clusters use common workstation operating systems such as
Linux running on each of the cluster's nodes, with a collection of
loosely-related libraries, utilities, and scripts to access the cluster's
resources. Such solutions are hardly adequate for large-scale clusters
and/or high-performance computing applications. The problems they
cause include (but are not limited to): (1) poor performance and scalability
of applications and system software; (2) reduced utilization of the
machine due to suboptimal resource allocation; (3) reliability problems
caused by the multitude of independent software modules, and the redundancy
in their operation, and (4) difficulty in operating and making full
use of these machines.
The premise behind this dissertation is that system software can be
dramatically improved in terms of performance, scalability, reliability,
and simplicity by making use of the features offered by modern interconnects.
Unlike single-node operating systems, most of a cluster's system software
tasks involve efficient global synchronization of resources. As such,
parallel system software can be designed to benefit from the novel
hardware features offered by contemporary interconnection technology.
This dissertation promotes the idea of treating a cluster's operating
system as any other high-performance parallel application, and increasing
its reliance on synchronization abilities while reducing its per-node
complexity and redundancy.
This dissertation makes the following primary contributions. First,
a set of necessary network mechanisms to support this system software
model is described. A prototype implementation of system software
based on these mechanisms is then discussed. This system currently
tackles three main aspects of parallel computers: resource management,
communication libraries, and job scheduling methods. This model was
implemented on three different cluster architectures. Extensive performance
and scalability evaluations with real clusters and applications show
significant improvements over previous work in all three areas. In
particular, this research focuses primarily on job scheduling strategies,
and demonstrates that through advanced algorithms, the system's throughput
and responsiveness can be improved over a wide spectrum of workloads.