Flexium Architecture¶
This document explains how Flexium enables live GPU migration for your training jobs.
Table of Contents¶
Overview¶
Flexium enables live GPU migration for PyTorch training jobs. Your training can be moved between GPUs without losing progress and with zero memory residue on the source GPU.
Key Capabilities¶
- Zero VRAM Residue: When a process migrates, ALL memory is freed from the source GPU
- In-Process Migration: Training continues in the same process, same loop iteration
- Transparent Integration: Just call
flexium.init()at the start of your script - Pause/Resume: Free GPU completely, resume later on any available GPU
How You Use It¶
┌───────────────────────────────────────────────────────────┐
│ YOUR TRAINING PROCESS │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ flexium.init() │ │
│ │ │ │
│ │ ┌───────────────────────────────────────────────┐ │ │
│ │ │ Your Training Code │ │ │
│ │ │ model.cuda(), optimizer.step(), etc. │ │ │
│ │ └───────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ GPU 0 │ │ GPU 1 │ │ GPU 2 │ │ GPU 3 │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
└───────────────────────────────────────────────────────────┘
│
│ Communicates with
▼
┌───────────────────────────────────────────────────────────┐
│ FLEXIUM CLOUD (flexium.ai) │
│ │
│ Web dashboard for monitoring and triggering migrations │
└───────────────────────────────────────────────────────────┘
How It Works¶
Zero VRAM Residue¶
Problem: Traditional approaches to GPU migration (moving tensors with .to()) leave memory fragments due to PyTorch's caching allocator.
Solution: Flexium captures and restores the complete GPU state at driver level, guaranteeing zero residue. Requires driver 550+ for pause/resume, 580+ for GPU migration.
In-Process Migration¶
Unlike traditional approaches, Flexium migrates within the same process: - No process restart required - Training continues from the exact same point - All Python state preserved (variables, loop counters, etc.)
Minimal Code Changes¶
Simple approach (recommended):
import flexium
flexium.init()
# 100% standard PyTorch code
model = Net().cuda()
optimizer = Adam(model.parameters())
for batch in dataloader:
...
Explicit scope control (advanced):
import flexium.auto
with flexium.auto.run():
# Flexium is active only within this block
model = Net().cuda()
for batch in dataloader:
...
Migration Mechanism¶
When you trigger a migration from the dashboard:
- Pause - Training pauses between batches
- Capture - Complete GPU state is captured at driver level
- Release - Source GPU is completely freed (0 MB)
- Restore - State is restored on target GPU
- Resume - Training continues from the exact same point
Your training code never knows it moved.
Configuration¶
Environment Variable (Recommended)¶
Inline Parameter¶
import flexium
flexium.init(server="app.flexium.ai/myworkspace")
# Or with explicit scope:
# with flexium.auto.run(orchestrator="app.flexium.ai/myworkspace"):
# ...
Config File (~/.flexiumrc)¶
Requirements¶
- Python 3.8+
- PyTorch 2.0+ with CUDA 12.4+
- NVIDIA Driver:
- 550+ for pause/resume (same GPU)
- 580+ for GPU migration (different GPU)
- Linux x86_64