Troubleshooting¶

Common issues and their solutions when using flexium.

Table of Contents¶

Connection Issues
Training Issues
Migration Issues
Memory Issues
Debugging
FAQ

Connection Issues¶

Cannot connect to Flexium¶

Symptoms: - "Failed to connect" error - Process shows as "stale" immediately - Heartbeat errors in logs

Solutions:

Check FLEXIUM_SERVER is set correctly:

echo $FLEXIUM_SERVER
# Should show: app.flexium.ai/yourworkspace

Check network connectivity:

# Test connection to Flexium cloud
nc -zv app.flexium.ai 443

Check firewall allows outbound connections:

# Ensure outbound HTTPS is allowed
curl -v https://app.flexium.ai/health

Verify workspace exists: Log in to app.flexium.ai and confirm your workspace name is correct.

Training process issues¶

Solutions:

Enable debug logging:

# Enable debug logging
FLEXIUM_LOG_LEVEL=DEBUG python train.py

Check CUDA availability:

python -c "import torch; print(torch.cuda.is_available())"

Process runs in local mode¶

Symptoms: - "Running in local mode" message - Process not visible in dashboard

Solutions:

Check FLEXIUM_SERVER is set:

export FLEXIUM_SERVER="app.flexium.ai/yourworkspace"

The training will still work: If connection fails, training continues in local mode (graceful degradation). It will reconnect when the connection is restored.

Training Issues¶

Training doesn't start¶

Symptoms: - No output from training - Process hangs after "Process: gpu-XXXXXX"

Solutions:

Check for import errors:

# Run without flexium first
python train.py --disabled

Check CUDA is available:

import torch
print(torch.cuda.is_available())
print(torch.cuda.device_count())

Check process output: The process output should appear in your terminal. If not, check stderr.

Model not on correct device¶

Symptoms: - CUDA errors about tensor device mismatch - Model on cuda:0 but data on cuda:1

Solutions:

Use .cuda() consistently:

import flexium
flexium.init()

model.cuda()       # Correct
model.to("cuda")   # Correct
model.to("cuda:0") # Correct - will be redirected

Migration Issues¶

Migration not happening¶

Symptoms: - Click "Migrate" in dashboard but nothing happens - Process stays on same GPU

Solutions:

Check process status in dashboard: Process should show "running", not "stale" or "migrating"
Verify heartbeats are working: Flexium sends migration commands via heartbeat response. If heartbeats aren't working, migration won't happen.
Check for pending migration: If a migration is already pending, new requests are ignored. Wait for the current migration to complete.
Same device check: Migration to the same device is a no-op.

Migration fails mid-way¶

Symptoms: - "Migration failed" error - Process stays on original GPU

Solutions:

Check /dev/shm space:

df -h /dev/shm
# Migration uses shared memory

Check for CUDA errors:

nvidia-smi
# Ensure target GPU is available and has enough memory

Check driver version:

nvidia-smi --query-gpu=driver_version --format=csv,noheader
# 550+ for pause/resume, 580+ for GPU migration

Memory Issues¶

Memory not freed after migration¶

Symptoms: - Old GPU still shows memory usage after migration - nvidia-smi shows process on old GPU

Solutions:

This shouldn't happen with Flexium.AI's architecture. If it does:

Check process is actually dead:

ps aux | grep train.py
# Should only show one process (the new one)

Check nvidia-smi:

nvidia-smi
# Old GPU should show no process from your training

Force cleanup:

# Kill any orphaned processes
pkill -f "train.py"

Debugging¶

Enable Debug Logging¶

FLEXIUM_LOG_LEVEL=DEBUG python train.py

Debug Mode¶

For debugging, use the disabled parameter:

import flexium

# Option 1: Use disabled parameter
flexium.init(disabled=True)  # Flexium bypassed entirely

# Option 2: Conditional init
if not debugging:
    flexium.init()

# Your training code runs as normal PyTorch

Automatic Debugger Detection¶

If you attach a debugger (PyCharm, VS Code, pdb), flexium automatically switches to debug mode.

Check Process Status¶

View all your processes in the dashboard at app.flexium.ai. Click on any process to see detailed status including device, memory usage, and runtime.

FAQ¶

Q: What happens if connection to Flexium is lost?¶

A: Training continues normally in "local mode". Key behaviors: - Running processes: Continue training without interruption. They automatically reconnect when connection is restored, preserving runtime tracking. - Paused processes: If paused for more than 5 minutes without connection, they auto-resume on the last-used device to prevent indefinite hangs. - Migration won't work until connection is restored, but training continues unaffected.

Q: Can I run multiple training jobs on the same GPU?¶

A: Yes, but they'll compete for memory. flexium doesn't prevent this - it just tracks and migrates processes.

Q: Does flexium work with DataParallel/DistributedDataParallel?¶

A: Not currently. flexium is designed for single-GPU training per process. Multi-GPU training within a single process is a future enhancement.

Q: What's the overhead of flexium?¶

A: Minimal, typically < 2%. The main overhead is: - Device string comparison on .cuda()/.to() calls - Iterator wrapping (negligible) - Background heartbeat thread (minimal CPU)

Q: Can I use flexium with mixed precision training?¶

A: Yes! flexium is transparent to mixed precision:

import flexium
flexium.init()

model = Model().cuda()
scaler = torch.cuda.amp.GradScaler()

with torch.cuda.amp.autocast():
    output = model(data)
    loss = criterion(output, target)

scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

Q: How do I disable flexium for certain environments?¶

A: Use the disabled parameter or conditionally call flexium.init():

import os
import flexium

# Option 1: Use disabled parameter (recommended)
flexium.init(disabled=os.environ.get("DISABLE_FLEXIUM") == "1")

# Option 2: Conditional init
if os.environ.get("PRODUCTION") == "1":
    flexium.init()

# Your training code - works with or without flexium
model = Net().cuda()

Q: Where does the Flexium server run?¶

A: Flexium is a cloud-hosted service at app.flexium.ai. You don't need to run any servers - just set your workspace:

export FLEXIUM_SERVER="app.flexium.ai/yourworkspace"

Getting Help¶

If you're still stuck:

Check the GitHub Issues on your repository
Enable debug logging and include the output

Include your environment details:

python --version
pip show torch
nvidia-smi

Troubleshooting¶

Table of Contents¶

Connection Issues¶

Cannot connect to Flexium¶

Training process issues¶

Process runs in local mode¶

Training Issues¶

Training doesn't start¶

Model not on correct device¶

Migration Issues¶

Migration not happening¶

Migration fails mid-way¶

Memory Issues¶

Memory not freed after migration¶

Debugging¶

Enable Debug Logging¶

Debug Mode¶

Automatic Debugger Detection¶

Check Process Status¶

FAQ¶

Q: What happens if connection to Flexium is lost?¶

Q: Can I run multiple training jobs on the same GPU?¶

Q: Does flexium work with DataParallel/DistributedDataParallel?¶

Q: What's the overhead of flexium?¶

Q: Can I use flexium with mixed precision training?¶

Q: How do I disable flexium for certain environments?¶

Q: Where does the Flexium server run?¶

Getting Help¶

See Also¶