Pause/Resume¶

Flexium provides pause and resume capabilities for long-running training jobs.

Driver Requirement: NVIDIA 550+ (pause/resume on same GPU), 580+ (resume on different GPU)

Overview¶

Training jobs can be paused at any point and resumed later. With driver 550+, you can resume on the same GPU. With driver 580+, you can resume on any GPU on the same machine.

How It Works¶

Pause - You trigger pause via the dashboard
GPU State Captured - Flexium captures the complete GPU state at driver level
GPU Freed - GPU memory is completely released (0 MB residue)
Resume - You trigger resume via the dashboard
Training Continues - Training continues from the exact point it was paused

Dashboard Display¶

When a process is paused: - Status shows "Paused" with a distinctive badge - Memory displays the last known memory usage before pause (not 0) - Runtime continues to track total training time (not reset)

This allows you to see at a glance how much GPU memory will be needed when resuming.

Auto-Resume on Disconnect¶

If connection to Flexium is lost while a process is paused, Flexium will:

Attempt to reconnect for up to 5 minutes
If reconnection fails, automatically resume training on the last-used device
Continue running in "local mode"
Automatically reconnect when connection is restored

This ensures your paused training jobs don't hang indefinitely.

Use Cases¶

Resource Sharing - Temporarily yield GPU for higher-priority jobs
Maintenance - Safely pause for system updates

Usage¶

import flexium
flexium.init()

# Your training code
train_model()
# Pause/resume is triggered via the dashboard

Or with explicit scope control:

import flexium.auto

with flexium.auto.run():
    train_model()

Pause and resume are triggered through the web dashboard - there is no manual API for pausing within your code.