Getting Started with Flexium.AI¶
This guide will help you set up Flexium.AI and run your first migration-enabled training job.
Prerequisites¶
- Python 3.8+
- PyTorch 2.0+ with CUDA support
- NVIDIA GPU with CUDA support
- NVIDIA Driver 580+ (required for zero-residue migration)
- Linux x86_64 (Windows/macOS not yet supported)
Installation¶
From PyPI¶
From Source¶
PyTorch Installation¶
Flexium requires PyTorch with CUDA support. Install PyTorch following the official instructions for your system and CUDA version.
Dependencies¶
Core dependencies (installed automatically):
python-socketio[client]>=5.0.0- WebSocket communicationpynvml>=11.0.0- NVIDIA Management Library for GPU monitoring
Development dependencies:
pytest>=7.0.0- Testing framework
Quick Start¶
Step 1: Connect to Flexium Server¶
Flexium is a cloud-hosted service. Set the FLEXIUM_SERVER environment variable with your workspace:
Sign up for free at app.flexium.ai to create your workspace.
Step 2: Add flexium to Your Training Script¶
Add just 2 lines to your existing code:
import flexium.auto # Add this import
# ... your existing imports ...
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
# Your model
class Net(nn.Module):
def __init__(self):
super().__init__()
self.fc = nn.Linear(784, 10)
def forward(self, x):
return self.fc(x)
# Training
with flexium.auto.run(): # Wrap your training
model = Net().cuda() # Standard PyTorch!
optimizer = torch.optim.Adam(model.parameters())
for epoch in range(100):
for batch in dataloader:
data, target = batch[0].cuda(), batch[1].cuda()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
optimizer.zero_grad()
That's it! Your training is now migration-enabled.
Step 3: Run Your Training¶
# Set server address with workspace
export FLEXIUM_SERVER="app.flexium.ai/myworkspace"
# Run your script normally
python train.py
Step 4: Monitor and Migrate¶
Open your workspace dashboard at app.flexium.ai to:
- See all running training jobs
- Monitor GPU utilization
- Trigger migrations with one click
- Pause and resume training jobs
Configuration¶
Environment Variables (Recommended)¶
# Server with workspace (required)
export FLEXIUM_SERVER="app.flexium.ai/myworkspace"
# Optional: default device
export GPU_DEVICE=cuda:0
Config File¶
Create ~/.flexiumrc:
Inline Parameters¶
Verification¶
Check Driver Version¶
Check Installation¶
Test Migration¶
-
Set your server connection:
-
Run example:
-
Open your workspace dashboard
-
Click "Migrate" to move training to another GPU
-
Watch training continue seamlessly!
-
Verify zero residue with
nvidia-smi- source GPU should show 0 MB for flexium process
Connection Resilience¶
Flexium automatically handles connection issues:
- On connection loss, you'll see:
[flexium] Lost connection, attempting reconnect... - On successful reconnection:
[flexium] Reconnected! - Training continues uninterrupted during brief outages
Next Steps¶
- Architecture Overview - Understand how it works
- API Reference - Full API documentation
- Examples - More code examples
- Troubleshooting - Common issues
Getting Help¶
- GitHub Issues: Report bugs or request features on your repository
- Documentation: You're reading it!