Deploy a model · PrivateMind Docs

If your organization has a GPU allocation, you can run your own inference servers on it. Each deployment is an OpenAI-compatible vLLM server pinned to GPUs you own, serving a model from the curated catalog or from your own weights. Once registered, it is callable through the same API as any built-in model.

Launch a deployment

Go to Settings → Compute → Inference deployments (vLLM) and start a new deployment. You choose:

Model source: a catalog model (pick from the curated catalog available to your org) or a custom volume (your own weights on a storage volume you created in the Compute page, with an optional subpath to the model directory).
Host: which of your GPU hosts to run on.
GPUs: select the specific GPUs, or MIG slices, to pin the server to. GPUs already in use are shown; a partially used GPU can be shared up to a memory cap.
vLLM version: from the supported list.

Under Advanced, you can set common vLLM parameters (max-model-len, gpu-memory-utilization, tensor-parallel-size, dtype, and so on) and add any other vLLM flag as a custom key/value pair.

Make it callable

A new deployment starts up (large models can take a few minutes) and shows its status, with the last error surfaced if it fails to start. A deployment is not on the API until you register it:

Register the deployment and give it a public model name. That name becomes a model id you can pass to the chat completions API and any OpenAI-compatible client, exactly like a built-in model.
Rename changes the public model name; unregister removes it from the API without deleting the server.
Drop stops the server, releases the GPUs, and unregisters it.

Call your model

Once registered, your model is just another id in the catalog:

Python

from openai import OpenAI

client = OpenAI(base_url="https://api.privatemind.com/v1", api_key="PMIND...:abcdef...")

resp = client.chat.completions.create(
    model="<your-public-model-name>",
    messages=[{"role": "user", "content": "Hello"}],
)
print(resp.choices[0].message.content)

Where next

Models: list what is live in your org, including your registered deployments.
Chat completions: the API your deployed model serves.
Compute & training: the rest of what a GPU allocation gives you.