GPU Error Recovery¶
Flexium provides automatic GPU error recovery, allowing training to survive and recover from common CUDA errors like OOM, ECC errors, and device asserts.
Overview¶
GPU errors are a common cause of failed training runs:
- OOM (Out of Memory): Batch size too large or memory fragmentation
- ECC Errors: Hardware memory corruption
- Device Assert: Kernel-level assertion failures
- Illegal Memory Access: Memory access violations
- Launch Failures: CUDA kernel launch issues
With Flexium's recoverable(), your training can automatically detect these errors, migrate to a healthy GPU, and optionally retry the failed operation.
Three Ways to Use recoverable()¶
Flexium provides three patterns, from simplest to most control:
Option 1: Simple Context Manager (Recommended)¶
The current operation is LOST, but training continues on the new GPU.
This is the simplest approach. If a GPU error occurs, Flexium migrates to a new GPU and suppresses the exception. The code inside the with block that failed is not retried - that batch/operation is lost, but training continues with the next iteration.
import flexium.auto
with flexium.auto.run():
model = Net().cuda()
optimizer = torch.optim.Adam(model.parameters())
for batch in dataloader:
with flexium.auto.recoverable():
# If OOM happens here, this batch is LOST
# but we migrate to new GPU and continue
output = model(batch.cuda())
loss = criterion(output, target)
loss.backward()
optimizer.step()
optimizer.zero_grad()
# Training continues here on next batch
Output when OOM occurs:
[flexium] GPU error: OOM
[flexium] WARNING: The current operation is LOST. Migrating to new GPU...
[flexium] Migrating to cuda:2...
[flexium] Migration complete. Training continues (current batch was lost).
Operation is Lost
With this pattern, the failing operation (batch) is not retried. For most deep learning training, losing one batch is acceptable. If you need to retry the exact same operation, use the decorator or iterator pattern below.
Option 2: Decorator (Replays the Operation)¶
The operation is RETRIED on the new GPU.
Wrap your training step in a function with the @recoverable decorator. If a GPU error occurs, Flexium migrates to a new GPU and calls the function again with the same arguments.
import flexium.auto
@flexium.auto.recoverable(retries=3)
def train_step(model, batch, optimizer, criterion):
output = model(batch.cuda())
loss = criterion(output, target)
loss.backward()
optimizer.step()
optimizer.zero_grad()
return loss.item()
with flexium.auto.run():
model = Net().cuda()
optimizer = torch.optim.Adam(model.parameters())
for batch in dataloader:
loss = train_step(model, batch, optimizer, criterion)
# If OOM happened, train_step was retried on new GPU
You can also use @recoverable without parentheses (uses default 3 retries):
Output when OOM occurs:
[flexium] GPU error: OOM (attempt 1/3)
[flexium] Recovering: migrating to cuda:2...
[flexium] Recovery successful - now on cuda:2, retrying operation...
Option 3: Iterator Pattern (Advanced)¶
Most control over retry logic.
You write the retry loop structure. This is useful when you need custom logic between retries.
import flexium.auto
with flexium.auto.run():
model = Net().cuda()
for batch in dataloader:
for attempt in flexium.auto.recoverable(retries=3):
with attempt:
output = model(batch.cuda())
loss = criterion(output, target)
loss.backward()
optimizer.step()
optimizer.zero_grad()
Supported Errors¶
| Error Type | Description | Recovery Action |
|---|---|---|
| OOM | Out of memory | Migrate to GPU with more VRAM |
| ECC | Uncorrectable ECC error | Mark GPU unhealthy, migrate away |
| Device Assert | Device-side assertion | Migrate to healthy GPU |
| Illegal Access | Illegal memory access | Migrate to healthy GPU |
| Launch Failure | CUDA launch failure | Migrate to healthy GPU |
Configuration¶
Retries (Decorator/Iterator only)¶
Control how many retry attempts are made:
# Decorator with custom retries
@flexium.auto.recoverable(retries=5)
def train_step():
...
# Iterator with custom retries
for attempt in flexium.auto.recoverable(retries=5):
with attempt:
...
Error Propagation¶
Non-CUDA errors and unrecognized RuntimeErrors are always re-raised immediately:
Requirements¶
- Migration must be enabled (NVIDIA driver 580+ required)
- Multiple GPUs available for recovery to work
If migration is disabled or no alternative GPU is available: - Simple context manager: the original error is re-raised - Decorator/Iterator: retries are attempted, then the error is re-raised
Standalone Mode¶
GPU error recovery works with or without an orchestrator connection:
-
With orchestrator: The orchestrator coordinates recovery, tracking GPU health across all processes and making smart decisions about which GPU to migrate to.
-
Without orchestrator (standalone): Flexium finds an alternative GPU locally by scanning available GPUs, checking their free memory, and selecting the best candidate. Previously failed GPUs are tracked and avoided.
How It Works¶
-
Error Detection: CUDA errors are caught and classified by type
-
Error State Clearing: CUDA error state is cleared via:
torch.cuda.synchronize()torch.cuda.empty_cache()-
torch.cuda.reset_peak_memory_stats() -
Recovery Target: For OOM errors, Flexium parses the error message to estimate memory needed and requests a GPU with sufficient free VRAM
-
Migration: Training is migrated using zero-residue migration
-
Continuation/Retry:
- Simple context manager: Exception is suppressed, training continues
- Decorator: Function is called again with same arguments
- Iterator: Next iteration of the for loop runs
Try It Out¶
Run the interactive demo to see GPU error recovery in action:
# Simple mode - operation is lost, training continues
python examples/simple/oom_recovery_demo.py --mode simple
# Decorator mode - operation is replayed with same data
python examples/simple/oom_recovery_demo.py --mode decorator
# Iterator mode - you control the retry loop
python examples/simple/oom_recovery_demo.py --mode iterator
The demo spawns a subprocess to fill GPU memory, then triggers OOM and shows the recovery process. The decorator and iterator modes verify that the same data produces the same result after migration.
Limitations¶
- Only works with supported CUDA error types
- If all GPUs are exhausted or unsuitable, the error is eventually re-raised
- Some errors (like ECC) may indicate hardware problems that affect all GPUs
- In standalone mode (no orchestrator), there's no coordination between processes - if multiple processes hit errors simultaneously, they may all try to migrate to the same GPU