If your organization has a GPU allocation, you can run your own inference servers on it. Each deployment is an OpenAI-compatible vLLM server pinned to GPUs you own, serving a model from the curated catalog or from your own weights. Once registered, it is callable through the same API as any built-in model.
Launch a deployment
Go to Settings → Compute → Inference deployments (vLLM) and start a new deployment. You choose:
- Model source — a catalog model (pick from the curated catalog available to your org) or a custom volume (your own weights on a storage volume you created in the Compute page, with an optional subpath to the model directory).
- Host — which of your GPU hosts to run on.
- GPUs — select the specific GPUs, or MIG slices, to pin the server to. GPUs already in use are shown; a partially used GPU can be shared up to a memory cap.
- vLLM version — from the supported list.
Under Advanced, you can set common vLLM parameters (max-model-len, gpu-memory-utilization, tensor-parallel-size, dtype, and so on) and add any other vLLM flag as a custom key/value pair.
Make it callable
A new deployment starts up (large models can take a few minutes) and shows its status, with the last error surfaced if it fails to start. A deployment is not on the API until you register it:
- Register the deployment and give it a public model name. That name becomes a model id you can pass to the chat completions API and any OpenAI-compatible client, exactly like a built-in model.
- Rename changes the public model name; unregister removes it from the API without deleting the server.
- Drop stops the server, releases the GPUs, and unregisters it.
Call your model
Once registered, your model is just another id in the catalog:
from openai import OpenAI
client = OpenAI(base_url="https://api.privatemind.com/v1", api_key="PMIND...:abcdef...")
resp = client.chat.completions.create(
model="<your-public-model-name>",
messages=[{"role": "user", "content": "Hello"}],
)
print(resp.choices[0].message.content)Where next
- Models — list what is live in your org, including your registered deployments.
- Chat completions — the API your deployed model serves.
- Compute & training — the rest of what a GPU allocation gives you.