Troubleshooting¶
Common issues and their solutions when using flexium.
Table of Contents¶
Connection Issues¶
Cannot connect to Flexium¶
Symptoms: - "Failed to connect" error - Process shows as "stale" immediately - Heartbeat errors in logs
Solutions:
-
Check FLEXIUM_SERVER is set correctly:
-
Check network connectivity:
-
Check firewall allows outbound connections:
-
Verify workspace exists: Log in to app.flexium.ai and confirm your workspace name is correct.
Training process issues¶
Solutions:
-
Enable debug logging:
-
Check CUDA availability:
Process runs in local mode¶
Symptoms: - "Running in local mode" message - Process not visible in dashboard
Solutions:
-
Check FLEXIUM_SERVER is set:
-
The training will still work: If connection fails, training continues in local mode (graceful degradation). It will reconnect when the connection is restored.
Training Issues¶
Training doesn't start¶
Symptoms: - No output from training - Process hangs after "Process: gpu-XXXXXX"
Solutions:
-
Check for import errors:
-
Check CUDA is available:
-
Check process output: The process output should appear in your terminal. If not, check stderr.
Model not on correct device¶
Symptoms: - CUDA errors about tensor device mismatch - Model on cuda:0 but data on cuda:1
Solutions:
- Use
.cuda()consistently:
Migration Issues¶
Migration not happening¶
Symptoms: - Click "Migrate" in dashboard but nothing happens - Process stays on same GPU
Solutions:
-
Check process status in dashboard: Process should show "running", not "stale" or "migrating"
-
Verify heartbeats are working: Flexium sends migration commands via heartbeat response. If heartbeats aren't working, migration won't happen.
-
Check for pending migration: If a migration is already pending, new requests are ignored. Wait for the current migration to complete.
-
Same device check: Migration to the same device is a no-op.
Migration fails mid-way¶
Symptoms: - "Migration failed" error - Process stays on original GPU
Solutions:
-
Check /dev/shm space:
-
Check for CUDA errors:
-
Check driver version:
Memory Issues¶
Memory not freed after migration¶
Symptoms: - Old GPU still shows memory usage after migration - nvidia-smi shows process on old GPU
Solutions:
This shouldn't happen with Flexium.AI's architecture. If it does:
-
Check process is actually dead:
-
Check nvidia-smi:
-
Force cleanup:
Debugging¶
Enable Debug Logging¶
Debug Mode¶
Automatic Debugger Detection¶
If you attach a debugger (PyCharm, VS Code, pdb), flexium automatically switches to debug mode.
Check Process Status¶
View all your processes in the dashboard at app.flexium.ai. Click on any process to see detailed status including device, memory usage, and runtime.
FAQ¶
Q: What happens if connection to Flexium is lost?¶
A: Training continues normally in "local mode". Key behaviors: - Running processes: Continue training without interruption. They automatically reconnect when connection is restored, preserving runtime tracking. - Paused processes: If paused for more than 5 minutes without connection, they auto-resume on the last-used device to prevent indefinite hangs. - Migration won't work until connection is restored, but training continues unaffected.
Q: Can I run multiple training jobs on the same GPU?¶
A: Yes, but they'll compete for memory. flexium doesn't prevent this - it just tracks and migrates processes.
Q: Does flexium work with DataParallel/DistributedDataParallel?¶
A: Not currently. flexium is designed for single-GPU training per process. Multi-GPU training within a single process is a future enhancement.
Q: What's the overhead of flexium?¶
A: Minimal, typically < 2%. The main overhead is: - Device string comparison on .cuda()/.to() calls - Iterator wrapping (negligible) - Background heartbeat thread (minimal CPU)
Q: Can I use flexium with mixed precision training?¶
A: Yes! flexium is transparent to mixed precision:
with flexium.auto.run():
model = Model().cuda()
scaler = torch.cuda.amp.GradScaler()
with torch.cuda.amp.autocast():
output = model(data)
loss = criterion(output, target)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
Q: How do I exclude flexium in production?¶
A: Use the disabled flag:
Or use environment variables:
Q: Where does the Flexium server run?¶
A: Flexium is a cloud-hosted service at app.flexium.ai. You don't need to run any servers - just set your workspace:
Getting Help¶
If you're still stuck:
- Check the GitHub Issues on your repository
- Enable debug logging and include the output
- Include your environment details: