# Los Alamos NATIONAL LABORATORY

## **Buffered Coscheduled (BCS) MPI** A New Approach in the System Software Design for Large-Scale Parallel Computers

Juan Fernández<sup>1,2</sup>, Fabrizio Petrini<sup>1</sup> and Eitan Frachtenberg<sup>1</sup> {juanf,fabrizio,eitanf}@lanl.gov <sup>1</sup>Modeling, Algorithms and Informatics Group (CCS-3) Los Alamos National Laboratory <sup>2</sup>Computer Engineering Department University of Murcia, SPAIN

# Motivation

BCS-MPI introduces a new approach to design system sofware for large-scale parallel machines. The goal is to reduce the complexity, non-determinism and redundancy of the main components of the system software with a minimal performance penalty.

BCS-MPI globally organizes all the system activities at a very fine granularity. Both computation and communication are scheduled at regular intervals, in a real-time fashion, and the scheduling decisions are taken after a global exchange of control information.

BCS-MPI is a lightweight MPI implementation that represents a trade-off between simplicity and performance. It is designed on top of a minimal set of communication primitives that are almost entirely implemented in the network interface card.

BCS-MPI has been succesfully validated with several scientific codes representative of the ASCI workload.

### Goals

| Goals                                                                       | <b>Current Status</b>                                           | <b>Future Work</b>         |
|-----------------------------------------------------------------------------|-----------------------------------------------------------------|----------------------------|
| Codis                                                                       | Current Status                                                  | I dtule work               |
| -Target: large-scale parallel machines                                      | -NIC-based implementation<br>on state-of-the-art hardware       | -Improved<br>Functional    |
| -Simplify the design of the communi-                                        | (low level of intrusion)                                        | Debugging                  |
| cation library and its implementation                                       | -Integrated Monitoring and<br>Debugging System which            | -Job Prioritization        |
| -Minimize/eliminate non-determinism<br>during the execution of MPI programs | provides different levels of<br>non-determinism                 | -µKernel<br>Implementation |
| -Automatic functional and performance debugging of MPI programs             | -Most existing scientific codes<br>run efficiently with BCS-MPI | -Checkpointing             |
| -Minimal performance penalty                                                | (based on MPICH)                                                | -Fault Tolerance           |



# Design

Intuition: a SIMD communication library runs MIMD MPI programs.

Hierarchical design based on a basic primitives.

**Global scheduling of computation**, communication and synchronization operations for MPI user code: Global NIC2 Heartbeat (500µsec time slices).

System activitities are organized in microphases within every time slice.

Scalability is facilitated by tightly provided by the hardware.

Integrated Monitoring and Debugging Mode which provides selectable level of non-determinism (in the strictest mode, the system is able to rerun an arbitrary large parallel program in a completely deterministic way).

Integration as a plugin in a resource management system for parallel jobs.



### **Software Configuration** 25 - Red Hat Linux 7.2 (seconds) - Intel C/Fortran 7.1.17 - SWEEP3D (50x50x50) 20 Runtime Results - Comparison between BCS-MPI and Quadrics MPI for different numbers of processors. Q

### **Cluster Configuration** - 32 HP rx2600 compute nodes - 128-port Quadrics switch **Compute Node Configuration** - Dual Itanium-II processor - 2 GB of ECC RAM - 2 133MHz/64-bit PCI-X buses - 2 Quadrics QM-400 Elan3 NIC - 100 Mbit Ethernet NIC

- set of communication/synchronization
- **NIC-based OS-bypass implementation.**
- coupling the collective communication operations with the collective primitives



### **Performance Evaluation**

