Rate limits & budgets · PrivateMind Docs

Every PrivateMind API key carries two independent controls: a budget cap (monthly spend) and a rate limit (calls per minute). Both are enforced server-side.

Budgets

Each key has a USD budget set at creation. Every request's prompt_tokens + completion_tokens is priced and added to the key's spend for the current month. Spend resets on the first of each calendar month (UTC). When spent_usd >= budget_usd, calls return:

Text

HTTP/1.1 402 Payment Required

JSON

{
  "error": {
    "message": "Key budget exhausted",
    "type": "billing_error",
    "code": "budget_exceeded"
  }
}

To recover, raise the cap in Settings → API Keys, rotate to a key with budget left, or wait for the monthly reset.

How spend is calculated

Each model has an input-token cost and an output-token cost; the published per-model rates are returned in the cost object on GET /v1/models. Streaming and non-streaming responses are priced identically. The final usage block on a stream is the source of truth.

Embeddings are priced on input tokens only (completion_tokens is always 0).

Spend is updated after each request completes. Mid-flight you can't observe it; instrument your client if you need finer-grained tracking.

Rate limits

Each key has a sliding-window requests-per-minute limit. Exceeding returns:

Text

HTTP/1.1 429 Too Many Requests

JSON

{
  "error": {
    "message": "Rate limit exceeded",
    "type": "rate_limit_error",
    "code": "rpm_exceeded"
  }
}

The rate is per-key, not per-org and not per-IP. Splitting traffic across multiple keys multiplies your effective ceiling.

Backoff

Python

import random, time, openai

def call_with_retry(fn, max_attempts=5):
    for attempt in range(max_attempts):
        try:
            return fn()
        except openai.RateLimitError:
            if attempt == max_attempts - 1:
                raise
            delay = min(30, (2 ** attempt) + random.random())
            time.sleep(delay)

Designing around the limits

One key per workload. Don't share a key across unrelated services. One misbehaving service burns the others' budget.
Tight budgets in dev. A runaway loop in a notebook can spend a month's quota in an hour. Use a separate dev key with a low cap.
Watch token usage in responses. Every response includes usage; surface it in logs.
Cache where you can. Identical embedding queries can be cached client-side.

Where to view spend

Per-key spend, budget remaining, and recent activity are shown in Settings → API Keys alongside the key list. The same view lets you rotate, revoke, and raise caps without involving an admin.

Where next

Errors for the full envelope around 402 and 429.
Authentication for key shape and rotation.
Usage for the API endpoint that surfaces spend programmatically.