Get API key

Chat completions

The primary inference endpoint. OpenAI-shape requests and responses.

POST /v1/chat/completions is the primary inference endpoint. Click Try it on the cURL tab to run against the demo API.

curl -s "https://api.privatemind.com/v1/chat/completions" \
  -H "Authorization: Bearer $PMIND_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "fast",
    "messages": [
      {"role": "system", "content": "You are a concise assistant."},
      {"role": "user",   "content": "Explain the CAP theorem."}
    ],
    "temperature": 0.2,
    "max_tokens": 400
  }'
from openai import OpenAI

client = OpenAI(
    base_url="https://api.privatemind.com/v1",
    api_key="PMIND...:abcdef...",
)

resp = client.chat.completions.create(
    model="fast",
    messages=[
        {"role": "system", "content": "You are a concise assistant."},
        {"role": "user",   "content": "Explain the CAP theorem."},
    ],
    temperature=0.2,
    max_tokens=400,
)
print(resp.choices[0].message.content)
import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'https://api.privatemind.com/v1',
  apiKey: process.env.PMIND_KEY,
});

const resp = await client.chat.completions.create({
  model: 'fast',
  messages: [
    { role: 'system', content: 'You are a concise assistant.' },
    { role: 'user',   content: 'Explain the CAP theorem.' },
  ],
  temperature: 0.2,
  max_tokens: 400,
});
console.log(resp.choices[0].message.content);

Parameters

  • model (required): id of a model returned by GET /v1/models.
  • messages (required): conversation so far. Each entry has a role (system, user, assistant, or tool) and a content string. For multimodal inputs, content may be an array. See Vision.
  • temperature (default 1.0): sampling temperature, 02. Lower is more deterministic.
  • top_p (default 1.0): nucleus sampling cutoff.
  • max_tokens: maximum tokens to generate. Hard ceiling is the model's context_length minus the prompt.
  • stream (default false): when true, response is delivered as SSE chunks. See Streaming.
  • stop: up to four stop sequences.
  • tools: function definitions the model may call. See Tool use.
  • reasoning_effort: "off", "low", "medium", or "high". Switches thinking on/off on hybrid models. See Reasoning effort below for the full behaviour matrix.
  • metadata: an OpenAI-shape map of up to 16 string→string tags (keys ≤64 chars, values ≤512) for attributing the call. Stored on the usage row and never forwarded to the model; query or roll up by it later. See Usage → Tagging requests.

Additional OpenAI fields (presence_penalty, frequency_penalty, response_format, seed) are forwarded to the engine. Honour depends on the model: check supported_parameters in GET /v1/models.

Reasoning effort

reasoning_effort (optional) is the standard way to switch a model's thinking on or off without passing raw chat_template_kwargs. Allowed values:

Value Meaning
"off" Disable thinking (hybrid models only).
"low" Enable thinking with the lightest budget the model exposes.
"medium" Enable thinking with a moderate budget.
"high" Enable thinking with the largest budget the model exposes.

What the API does with the value depends on the model's runtime mode (see Models):

Runtime mode Example models "off" "low" / "medium" / "high"
Hybrid Qwen 3.5, DeepSeek V4 Pro, Kimi K2.6, Mistral Medium, Nemotron Disables thinking Enables thinking
Thinking-only Qwen 3 VL Thinking 400 No-op (model always thinks)
Non-thinking Gemma family 400 400
cURL
curl -s "https://api.privatemind.com/v1/chat/completions" \
  -H "Authorization: Bearer $PMIND_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "reasoning",
    "messages": [{"role": "user", "content": "Plan a four-step proof of Pythagoras."}],
    "reasoning_effort": "low"
  }'

Response

JSON
{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "model": "<model-id>",
  "choices": [
    {
      "index": 0,
      "message": { "role": "assistant", "content": "The CAP theorem ..." },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 42,
    "completion_tokens": 128,
    "total_tokens": 170
  }
}

usage.prompt_tokens and usage.completion_tokens are what your key is billed against.

Finish reasons

Value Meaning
stop Natural end of generation or a stop sequence matched
length Hit max_tokens or the context window
tool_calls The model wants to call one or more tools. See Tool use
content_filter Output was blocked by a safety filter

Where next

  • Streaming for the SSE variant of this endpoint.
  • Tool use for function calling on top of chat completions.
  • Vision for the multimodal content-array shape.
  • Models for capability discovery against your org's catalogue.