GPU Error Recovery¶

Driver Requirement: NVIDIA 580+ (requires GPU migration capability)

Flexium provides automatic GPU error recovery, allowing training to survive and recover from common CUDA errors like OOM, ECC errors, and device asserts.

Overview¶

GPU errors are a common cause of failed training runs:

OOM (Out of Memory): Batch size too large or memory fragmentation
ECC Errors: Hardware memory corruption
Device Assert: Kernel-level assertion failures
Illegal Memory Access: Memory access violations
Launch Failures: CUDA kernel launch issues

With Flexium's recoverable(), your training can automatically detect these errors, migrate to a healthy GPU, and optionally retry the failed operation.

Three Ways to Use `recoverable()`¶

Flexium provides three patterns, from simplest to most control:

Option 1: Simple Context Manager (Recommended)¶

The current operation is LOST, but training continues on the new GPU.

This is the simplest approach. If a GPU error occurs, Flexium migrates to a new GPU and suppresses the exception. The code inside the with block that failed is not retried - that batch/operation is lost, but training continues with the next iteration.

import flexium
flexium.init()

model = Net().cuda()
optimizer = torch.optim.Adam(model.parameters())

for batch in dataloader:
    with flexium.auto.recoverable():
        # If OOM happens here, this batch is LOST
        # but we migrate to new GPU and continue
        output = model(batch.cuda())
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

    # Training continues here on next batch

Using explicit scope control

You can also use flexium.auto.run() instead of flexium.init():

with flexium.auto.run():
    # ... same code as above

Output when OOM occurs:

[flexium] GPU error: OOM
[flexium] WARNING: The current operation is LOST. Migrating to new GPU...
[flexium] Migrating to cuda:2...
[flexium] Migration complete. Training continues (current batch was lost).

Operation is Lost

With this pattern, the failing operation (batch) is not retried. For most deep learning training, losing one batch is acceptable. If you need to retry the exact same operation, use the decorator or iterator pattern below.

Option 2: Decorator (Replays the Operation)¶

The operation is RETRIED on the new GPU.

Wrap your training step in a function with the @recoverable decorator. If a GPU error occurs, Flexium migrates to a new GPU and calls the function again with the same arguments.

import flexium
flexium.init()

@flexium.auto.recoverable(retries=3)
def train_step(model, batch, optimizer, criterion):
    output = model(batch.cuda())
    loss = criterion(output, target)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()
    return loss.item()

model = Net().cuda()
optimizer = torch.optim.Adam(model.parameters())

for batch in dataloader:
    loss = train_step(model, batch, optimizer, criterion)
    # If OOM happened, train_step was retried on new GPU

You can also use @recoverable without parentheses (uses default 3 retries):

@flexium.auto.recoverable
def train_step(model, batch):
    ...

Output when OOM occurs:

[flexium] GPU error: OOM (attempt 1/3)
[flexium] Recovering: migrating to cuda:2...
[flexium] Recovery successful - now on cuda:2, retrying operation...

Option 3: Iterator Pattern (Advanced)¶

Most control over retry logic.

You write the retry loop structure. This is useful when you need custom logic between retries.

import flexium
flexium.init()

model = Net().cuda()

for batch in dataloader:
    for attempt in flexium.auto.recoverable(retries=3):
        with attempt:
            output = model(batch.cuda())
            loss = criterion(output, target)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

Supported Errors¶

Error Type	Description	Recovery Action
OOM	Out of memory	Migrate to GPU with more VRAM
ECC	Uncorrectable ECC error	Mark GPU unhealthy, migrate away
Device Assert	Device-side assertion	Migrate to healthy GPU
Illegal Access	Illegal memory access	Migrate to healthy GPU
Launch Failure	CUDA launch failure	Migrate to healthy GPU

Configuration¶

Retries (Decorator/Iterator only)¶

Control how many retry attempts are made:

# Decorator with custom retries
@flexium.auto.recoverable(retries=5)
def train_step():
    ...

# Iterator with custom retries
for attempt in flexium.auto.recoverable(retries=5):
    with attempt:
        ...

Error Propagation¶

Non-CUDA errors and unrecognized RuntimeErrors are always re-raised immediately:

with flexium.auto.recoverable():
    raise ValueError("Not a CUDA error")  # Re-raised immediately

Requirements¶

Migration must be enabled (NVIDIA driver 580+ required)
Multiple GPUs available for recovery to work

If migration is disabled or no alternative GPU is available: - Simple context manager: the original error is re-raised - Decorator/Iterator: retries are attempted, then the error is re-raised

Standalone Mode¶

GPU error recovery works with or without an orchestrator connection:

With orchestrator: The orchestrator coordinates recovery, tracking GPU health across all processes and making smart decisions about which GPU to migrate to.
Without orchestrator (standalone): Flexium finds an alternative GPU locally by scanning available GPUs, checking their free memory, and selecting the best candidate. Previously failed GPUs are tracked and avoided.

How It Works¶

Error Detection: CUDA errors are caught and classified by type
Error State Clearing: CUDA error state is cleared via:
torch.cuda.synchronize()
torch.cuda.empty_cache()
torch.cuda.reset_peak_memory_stats()
Recovery Target: For OOM errors, Flexium parses the error message to estimate memory needed and requests a GPU with sufficient free VRAM
Migration: Training is migrated using zero-residue migration
Continuation/Retry:
Simple context manager: Exception is suppressed, training continues
Decorator: Function is called again with same arguments
Iterator: Next iteration of the for loop runs

Try It Out¶

Run the interactive demo to see GPU error recovery in action:

# Simple mode - operation is lost, training continues
python examples/simple/oom_recovery_demo.py --mode simple

# Decorator mode - operation is replayed with same data
python examples/simple/oom_recovery_demo.py --mode decorator

# Iterator mode - you control the retry loop
python examples/simple/oom_recovery_demo.py --mode iterator

The demo spawns a subprocess to fill GPU memory, then triggers OOM and shows the recovery process. The decorator and iterator modes verify that the same data produces the same result after migration.

Limitations¶

Only works with supported CUDA error types
If all GPUs are exhausted or unsuitable, the error is eventually re-raised
Some errors (like ECC) may indicate hardware problems that affect all GPUs
In standalone mode (no orchestrator), there's no coordination between processes - if multiple processes hit errors simultaneously, they may all try to migrate to the same GPU