Python SDK · PrivateMind Docs

privatemind-python is the official Python SDK for submitting and managing training jobs. It's a thin client over the training API: notebooks, scripts, and CI all use the same interface.

Text

pip install privatemind-python

Authentication

The SDK resolves credentials in order:

token= passed to Client(...)
the PRIVATEMIND_TOKEN environment variable
~/.privatemind/auth (must be mode 0600)

Use a PrivateMind API key as the token, and set the base URL with base_url= or PRIVATEMIND_URL (https://api.privatemind.com).

Submit a job

submit_training is the core entry point. It is keyword-only and returns a Run handle.

Python

from privatemind import submit_training

run = submit_training(
    entrypoint="python train.py --epochs 10",
    image="ghcr.io/acme/trainer:1.2.3",
    workers=1,
    target_cluster="<your-gpu-cluster>",
    gpus_per_worker=1,
)

print(run)            # Run(name='tj-abc12', phase='Pending')
print(run.url)        # link to the run in the PrivateMind UI
run.wait()            # block until the job is terminal
print(run.phase, run.mlflow_run_id)

Parameter	Default	Purpose
`entrypoint`	required	The command each worker runs, e.g. `python train.py`.
`image`	required*	Container image the workers run.
`workers`	required	Number of workers.
`target_cluster`	required*	Your org's GPU cluster name (from your dashboard or account team).
`gpus_per_worker`	`0`	GPUs requested per worker.
`gpu_placements`	`None`	Optional explicit GPU pinning; omit it and the platform picks owned, free GPUs for you (see below).
`cpu_per_worker`, `memory_per_worker`	`None`	Per-worker CPU and memory requests.
`env`	`None`	Environment variables for the job.
`mlflow`	`True`	Track the run in MLflow so `run.mlflow_run_id` populates.
`volumes`	`None`	Volumes to mount into every worker (see below).
`ttl_seconds_after_finished`	`None`	Auto-clean the job this long after it finishes.
`name`	generated	Job name; defaults to a generated short name.

* Inside a GPU workspace, image and target_cluster default from the environment and can be omitted.

GPUs

Set gpus_per_worker and the platform pins real GPUs for you. You do not need to know which physical GPUs your org owns: when you leave gpu_placements unset, the platform auto-derives a placement from the GPUs your org owns that are free right now. To pin exact devices, pass one placement per GPU worker, each with exactly gpus_per_worker indices:

Python

gpu_placements=[{"host": "gpu-node-1", "indices": [0]}]

For multi-node training, set workers greater than 1: each worker gets gpus_per_worker GPUs and the job spreads across machines (one placement per worker). Leave gpu_placements unset and the platform will pack workers onto the fewest hosts possible. A job that fits on one machine stays there; only larger jobs span multiple hosts. A job that asks for GPUs it does not own, or that conflict with another running job, is rejected at submit.

The Run handle

submit_training returns a Run, a live handle to the job:

phase is the lifecycle state: Pending, Running, then a terminal Succeeded, Failed, Cancelled, Aborted, or Unknown.
url links to the run in the UI; mlflow_run_id links it to MLflow.
start_time and end_time are the job's wall-clock bounds.
refresh() re-reads state, wait(timeout=None) blocks until the job is terminal, and cancel() stops a running job.

List or look up jobs without holding the original handle:

Python

from privatemind import list_jobs, get_job

for r in list_jobs():
    print(r.name, r.phase)

get_job("tj-abc12").cancel()

Completion webhooks

Instead of polling with run.wait(), you can have the platform POST to a URL of yours when a job succeeds or fails. Pass webhook_url (and, to sign the callback, webhook_secret) on submit:

Python

run = submit_training(
    entrypoint="python train.py",
    image="ghcr.io/acme/trainer:1.2.3",
    workers=1,
    target_cluster="<your-gpu-cluster>",
    webhook_url="https://hooks.acme.com/pmind/training",
    webhook_secret="whsec_...",   # optional; enables signing
)

webhook_url must be https. Before it works, your org admin has to register the host as an approved egress destination (see the prerequisite below); until then, submit is rejected.

When the job succeeds or fails, the platform sends one POST. Delivery is at-least-once: a retry can repeat a delivery, so treat delivery_id as an idempotency key and dedupe on it.

JSON

{
  "job_id": "tj-abc12",
  "name": "tj-abc12",
  "delivery_id": "9f1c…",
  "status": "Succeeded",
  "started_at": "2026-01-02T10:00:00Z",
  "finished_at": "2026-01-02T10:14:30Z"
}

status is the terminal phase, Succeeded or Failed. Two fields are conditional: a Failed body also carries an error string, and an mlflow_run_id is included only once the run has one assigned (it may be absent in the current release).

Verifying the signature

When you set webhook_secret, every delivery carries Standard Webhooks headers:

webhook-id: the delivery_id (stable across retries of the same delivery).
webhook-timestamp: unix seconds.
webhook-signature: v1,<base64 HMAC-SHA256(secret, "{webhook-id}.{webhook-timestamp}.{raw-body}")>.

Verify with any Standard Webhooks library, or directly:

Python

import base64, hashlib, hmac

def verify(secret, headers, raw_body):
    signed = f"{headers['webhook-id']}.{headers['webhook-timestamp']}.{raw_body}"
    expected = base64.b64encode(
        hmac.new(secret.encode(), signed.encode(), hashlib.sha256).digest()
    ).decode()
    got = headers["webhook-signature"].removeprefix("v1,")
    return hmac.compare_digest(expected, got)

Sign and verify over the raw request body, not a re-serialised copy: reparsing and re-dumping the JSON changes the bytes and breaks the signature. Reject deliveries whose webhook-timestamp is far from now to blunt replays.

Checking delivery from the SDK

run.webhook_status reports where a delivery stands, server-authoritative:

Value	Meaning
`not_configured`	no `webhook_url` was set
`disabled`	a webhook is configured but delivery is off in this environment
`pending`	queued or mid-retry
`blocked_egress`	the host is not an approved egress destination for your org; it will not deliver until an admin registers it
`delivered`	the callback was accepted (2xx)
`failed`	retries exhausted
`unknown`	the server could not read the delivery state; re-check shortly

Prerequisite: the host must be allowlisted

The data tier reaches the internet only through a per-org egress proxy confined to hosts your org has registered, so a webhook target is not a free-form URL. Your org admin registers the webhook host once as an egress destination. A submit that names a host which is not a registered destination is rejected outright (400); nothing is queued. The blocked_egress status is the other case: a host that was registered when you submitted but was removed before the job finished, so the pending delivery parks until it is registered again. Either way it is a one-time onboarding step per host, not per job.

Promote a notebook function

@pm.train takes a function you validated interactively and runs it as a distributed, MLflow-tracked job, without leaving Python.

Python

import privatemind as pm

@pm.train(workers=1, gpus_per_worker=1)
def train(lr=3e-4, epochs=10):
    import mlflow, torch       # re-resolved on the worker, must be in the image
    assert torch.cuda.is_available()
    ...

train(lr=1e-3)                 # runs locally in the notebook (validate)
run = train.promote(lr=1e-3)   # packages + submits to the GPU -> Run
run.wait(); print(run.mlflow_run_id)

Calling the function runs it in the notebook; .promote() ships it to the fleet. .with_options(...) returns a copy with tweaked settings without redecorating.

A GPU promote runs your function on a GPU worker, so plain torch code that uses CUDA works as written. For multi-GPU or multi-node training, use Ray Train or ray.remote inside the function as you would in any Ray program.

Every promote is tracked in MLflow with no setup: the run is named after the job (not a random name), CPU/GPU/memory system metrics are captured automatically, and anything you log inside the function (metrics, artifacts, mlflow.pytorch.log_model(...)) lands on that run. Pass experiment="my-experiment" to @pm.train to group runs under a named experiment.

Mounting data

volumes mounts storage into every worker. Each entry is a mapping {"volume": str, "mount_path": str, "read_only": bool}:

Python

run = submit_training(
    entrypoint="python train.py",
    image="ghcr.io/acme/trainer:1.2.3",
    workers=1,
    target_cluster="<your-gpu-cluster>",
    gpus_per_worker=1,
    volumes=[{"volume": "datasets", "mount_path": "/data", "read_only": True}],
)

Explicit client

For a longer session, or to manage credentials yourself, use an explicit Client as a context manager:

Python

from privatemind import Client

with Client(base_url="https://api.privatemind.com", token="PMIND...:abcdef...") as pm:
    run = pm.submit_training(
        entrypoint="python train.py", image="...", workers=4, target_cluster="<your-gpu-cluster>",
    )
    for r in pm.list_jobs():
        print(r.name, r.phase)

Errors are typed (AuthError, ConfigError, ValidationError, NotFoundError, ConflictError, RateLimitError, ServerError), all under a common PrivatemindError, so you can catch precisely.

Where next

Compute & training: what the platform offers and how it fits together.
API keys: create the key the SDK authenticates with.