Get API key

Python SDK

Submit and manage distributed training jobs from any notebook or script.

privatemind-python is the official Python SDK for submitting and managing training jobs. It's a thin client over the training API: notebooks, scripts, and CI all use the same interface.

Text
pip install privatemind-python

Authentication

The SDK resolves credentials in order:

  1. token= passed to Client(...)
  2. the PRIVATEMIND_TOKEN environment variable
  3. ~/.privatemind/auth (must be mode 0600)

Use a PrivateMind API key as the token, and set the base URL with base_url= or PRIVATEMIND_URL (https://api.privatemind.com).

Submit a job

submit_training is the core entry point. It is keyword-only and returns a Run handle.

Python
from privatemind import submit_training

run = submit_training(
    entrypoint="python train.py --epochs 10",
    image="ghcr.io/acme/trainer:1.2.3",
    workers=1,
    target_cluster="<your-gpu-cluster>",
    gpus_per_worker=1,
)

print(run)            # Run(name='tj-abc12', phase='Pending')
print(run.url)        # link to the run in the PrivateMind UI
run.wait()            # block until Succeeded or Failed
print(run.phase, run.mlflow_run_id)
Parameter Default Purpose
entrypoint required The command each worker runs, e.g. python train.py.
image required* Container image the workers run.
workers required Number of workers.
target_cluster required* Your org's GPU cluster name (from your dashboard or account team).
gpus_per_worker 0 GPUs requested per worker.
gpu_placements None Optional explicit GPU pinning; omit it and the platform picks owned, free GPUs for you (see below).
cpu_per_worker, memory_per_worker None Per-worker CPU and memory requests.
env None Environment variables for the job.
mlflow True Track the run in MLflow so run.mlflow_run_id populates.
volumes None Volumes to mount into every worker (see below).
ttl_seconds_after_finished None Auto-clean the job this long after it finishes.
name generated Job name; defaults to a generated short name.

* Inside a GPU workspace, image and target_cluster default from the environment and can be omitted.

GPUs

Set gpus_per_worker and the platform pins real GPUs for you. You do not need to know which physical GPUs your org owns: when you leave gpu_placements unset, the platform auto-derives a placement from the GPUs your org owns that are free right now. To pin exact devices, pass one placement per GPU worker, each with exactly gpus_per_worker indices:

Python
gpu_placements=[{"host": "gpu-node-1", "indices": [0]}]

For multi-node training, set workers greater than 1: each worker gets gpus_per_worker GPUs and the job spreads across machines (one placement per worker). Leave gpu_placements unset and the platform will pack workers onto the fewest hosts possible. A job that fits on one machine stays there; only larger jobs span multiple hosts. A job that asks for GPUs it does not own, or that conflict with another running job, is rejected at submit.

The Run handle

submit_training returns a Run, a live handle to the job:

  • phase is the lifecycle state (Pending, Running, Succeeded, Failed).
  • url links to the run in the UI; mlflow_run_id links it to MLflow.
  • start_time and end_time are the job's wall-clock bounds.
  • refresh() re-reads state, wait(timeout=None) blocks until the job is terminal, and cancel() stops a running job.

List or look up jobs without holding the original handle:

Python
from privatemind import list_jobs, get_job

for r in list_jobs():
    print(r.name, r.phase)

get_job("tj-abc12").cancel()

Promote a notebook function

@pm.train takes a function you validated interactively and runs it as a distributed, MLflow-tracked job, without leaving Python.

Python
import privatemind as pm

@pm.train(workers=1, gpus_per_worker=1)
def train(lr=3e-4, epochs=10):
    import mlflow, torch       # re-resolved on the worker, must be in the image
    assert torch.cuda.is_available()
    ...

train(lr=1e-3)                 # runs locally in the notebook (validate)
run = train.promote(lr=1e-3)   # packages + submits to the GPU -> Run
run.wait(); print(run.mlflow_run_id)

Calling the function runs it in the notebook; .promote() ships it to the fleet. .with_options(...) returns a copy with tweaked settings without redecorating.

A GPU promote runs your function on a GPU worker, so plain torch code that uses CUDA works as written. For multi-GPU or multi-node training, use Ray Train or ray.remote inside the function as you would in any Ray program.

Every promote is tracked in MLflow with no setup: the run is named after the job (not a random name), CPU/GPU/memory system metrics are captured automatically, and anything you log inside the function — metrics, artifacts, mlflow.pytorch.log_model(...) — lands on that run. Pass experiment="my-experiment" to @pm.train to group runs under a named experiment.

Mounting data

volumes mounts storage into every worker. Each entry is a mapping {"volume": str, "mount_path": str, "read_only": bool}:

Python
run = submit_training(
    entrypoint="python train.py",
    image="ghcr.io/acme/trainer:1.2.3",
    workers=1,
    target_cluster="<your-gpu-cluster>",
    gpus_per_worker=1,
    volumes=[{"volume": "datasets", "mount_path": "/data", "read_only": True}],
)

Explicit client

For a longer session, or to manage credentials yourself, use an explicit Client as a context manager:

Python
from privatemind import Client

with Client(base_url="https://api.privatemind.com", token="PMIND...:abcdef...") as pm:
    run = pm.submit_training(
        entrypoint="python train.py", image="...", workers=4, target_cluster="<your-gpu-cluster>",
    )
    for r in pm.list_jobs():
        print(r.name, r.phase)

Errors are typed (AuthError, ConfigError, ValidationError, NotFoundError, ConflictError, RateLimitError, ServerError), all under a common PrivatemindError, so you can catch precisely.

Where next

  • Compute & training — what the platform offers and how it fits together.
  • API keys — create the key the SDK authenticates with.