privatemind-python is the official Python SDK for submitting and managing training jobs. It's a thin client over the training API: notebooks, scripts, and CI all use the same interface.
pip install privatemind-pythonAuthentication
The SDK resolves credentials in order:
token=passed toClient(...)- the
PRIVATEMIND_TOKENenvironment variable ~/.privatemind/auth(must be mode0600)
Use a PrivateMind API key as the token, and set the base URL with base_url= or PRIVATEMIND_URL (https://api.privatemind.com).
Submit a job
submit_training is the core entry point. It is keyword-only and returns a Run handle.
from privatemind import submit_training
run = submit_training(
entrypoint="python train.py --epochs 10",
image="ghcr.io/acme/trainer:1.2.3",
workers=1,
target_cluster="<your-gpu-cluster>",
gpus_per_worker=1,
)
print(run) # Run(name='tj-abc12', phase='Pending')
print(run.url) # link to the run in the PrivateMind UI
run.wait() # block until Succeeded or Failed
print(run.phase, run.mlflow_run_id)| Parameter | Default | Purpose |
|---|---|---|
entrypoint |
required | The command each worker runs, e.g. python train.py. |
image |
required* | Container image the workers run. |
workers |
required | Number of workers. |
target_cluster |
required* | Your org's GPU cluster name (from your dashboard or account team). |
gpus_per_worker |
0 |
GPUs requested per worker. |
gpu_placements |
None |
Optional explicit GPU pinning; omit it and the platform picks owned, free GPUs for you (see below). |
cpu_per_worker, memory_per_worker |
None |
Per-worker CPU and memory requests. |
env |
None |
Environment variables for the job. |
mlflow |
True |
Track the run in MLflow so run.mlflow_run_id populates. |
volumes |
None |
Volumes to mount into every worker (see below). |
ttl_seconds_after_finished |
None |
Auto-clean the job this long after it finishes. |
name |
generated | Job name; defaults to a generated short name. |
* Inside a GPU workspace, image and target_cluster default from the environment and can be omitted.
GPUs
Set gpus_per_worker and the platform pins real GPUs for you. You do not need to know which physical GPUs your org owns: when you leave gpu_placements unset, the platform auto-derives a placement from the GPUs your org owns that are free right now. To pin exact devices, pass one placement per GPU worker, each with exactly gpus_per_worker indices:
gpu_placements=[{"host": "gpu-node-1", "indices": [0]}]For multi-node training, set workers greater than 1: each worker gets gpus_per_worker GPUs and the job spreads across machines (one placement per worker). Leave gpu_placements unset and the platform will pack workers onto the fewest hosts possible. A job that fits on one machine stays there; only larger jobs span multiple hosts. A job that asks for GPUs it does not own, or that conflict with another running job, is rejected at submit.
The Run handle
submit_training returns a Run, a live handle to the job:
phaseis the lifecycle state (Pending,Running,Succeeded,Failed).urllinks to the run in the UI;mlflow_run_idlinks it to MLflow.start_timeandend_timeare the job's wall-clock bounds.refresh()re-reads state,wait(timeout=None)blocks until the job is terminal, andcancel()stops a running job.
List or look up jobs without holding the original handle:
from privatemind import list_jobs, get_job
for r in list_jobs():
print(r.name, r.phase)
get_job("tj-abc12").cancel()Promote a notebook function
@pm.train takes a function you validated interactively and runs it as a distributed, MLflow-tracked job, without leaving Python.
import privatemind as pm
@pm.train(workers=1, gpus_per_worker=1)
def train(lr=3e-4, epochs=10):
import mlflow, torch # re-resolved on the worker, must be in the image
assert torch.cuda.is_available()
...
train(lr=1e-3) # runs locally in the notebook (validate)
run = train.promote(lr=1e-3) # packages + submits to the GPU -> Run
run.wait(); print(run.mlflow_run_id)Calling the function runs it in the notebook; .promote() ships it to the fleet. .with_options(...) returns a copy with tweaked settings without redecorating.
A GPU promote runs your function on a GPU worker, so plain torch code that uses CUDA works as written. For multi-GPU or multi-node training, use Ray Train or ray.remote inside the function as you would in any Ray program.
Every promote is tracked in MLflow with no setup: the run is named after the job (not a random name), CPU/GPU/memory system metrics are captured automatically, and anything you log inside the function — metrics, artifacts, mlflow.pytorch.log_model(...) — lands on that run. Pass experiment="my-experiment" to @pm.train to group runs under a named experiment.
Mounting data
volumes mounts storage into every worker. Each entry is a mapping {"volume": str, "mount_path": str, "read_only": bool}:
run = submit_training(
entrypoint="python train.py",
image="ghcr.io/acme/trainer:1.2.3",
workers=1,
target_cluster="<your-gpu-cluster>",
gpus_per_worker=1,
volumes=[{"volume": "datasets", "mount_path": "/data", "read_only": True}],
)Explicit client
For a longer session, or to manage credentials yourself, use an explicit Client as a context manager:
from privatemind import Client
with Client(base_url="https://api.privatemind.com", token="PMIND...:abcdef...") as pm:
run = pm.submit_training(
entrypoint="python train.py", image="...", workers=4, target_cluster="<your-gpu-cluster>",
)
for r in pm.list_jobs():
print(r.name, r.phase)Errors are typed (AuthError, ConfigError, ValidationError, NotFoundError, ConflictError, RateLimitError, ServerError), all under a common PrivatemindError, so you can catch precisely.
Where next
- Compute & training — what the platform offers and how it fits together.
- API keys — create the key the SDK authenticates with.