Get API key

Deploy a model

Run your own OpenAI-compatible model server on your GPUs, from the catalog or your own weights.

If your organization has a GPU allocation, you can run your own inference servers on it. Each deployment is an OpenAI-compatible vLLM server pinned to GPUs you own, serving a model from the curated catalog or from your own weights. Once registered, it is callable through the same API as any built-in model.

Launch a deployment

Go to Settings → Compute → Inference deployments (vLLM) and start a new deployment. You choose:

  • Model source — a catalog model (pick from the curated catalog available to your org) or a custom volume (your own weights on a storage volume you created in the Compute page, with an optional subpath to the model directory).
  • Host — which of your GPU hosts to run on.
  • GPUs — select the specific GPUs, or MIG slices, to pin the server to. GPUs already in use are shown; a partially used GPU can be shared up to a memory cap.
  • vLLM version — from the supported list.

Under Advanced, you can set common vLLM parameters (max-model-len, gpu-memory-utilization, tensor-parallel-size, dtype, and so on) and add any other vLLM flag as a custom key/value pair.

Make it callable

A new deployment starts up (large models can take a few minutes) and shows its status, with the last error surfaced if it fails to start. A deployment is not on the API until you register it:

  • Register the deployment and give it a public model name. That name becomes a model id you can pass to the chat completions API and any OpenAI-compatible client, exactly like a built-in model.
  • Rename changes the public model name; unregister removes it from the API without deleting the server.
  • Drop stops the server, releases the GPUs, and unregisters it.

Call your model

Once registered, your model is just another id in the catalog:

Python
from openai import OpenAI

client = OpenAI(base_url="https://api.privatemind.com/v1", api_key="PMIND...:abcdef...")

resp = client.chat.completions.create(
    model="<your-public-model-name>",
    messages=[{"role": "user", "content": "Hello"}],
)
print(resp.choices[0].message.content)

Where next

  • Models — list what is live in your org, including your registered deployments.
  • Chat completions — the API your deployed model serves.
  • Compute & training — the rest of what a GPU allocation gives you.