Streaming · PrivateMind Docs

Set "stream": true on a chat-completions request and the API responds with Server-Sent Events instead of a single JSON body.

cURL

curl -N "https://api.privatemind.com/v1/chat/completions" \
  -H "Authorization: Bearer $PMIND_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "fast",
    "messages": [{"role": "user", "content": "Write a haiku about TCP."}],
    "stream": true
  }'

curl -N disables curl's output buffering so chunks print as they arrive.

Wire format

Each event is a data: <json> line:

Text

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"delta":{"role":"assistant"},"index":0}]}

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"delta":{"content":"Packets"},"index":0}]}

...

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"delta":{},"index":0,"finish_reason":"stop"}],"usage":{"prompt_tokens":12,"completion_tokens":28,"total_tokens":40}}

data: [DONE]

choices[].delta.content is the incremental token text. Concatenate to build the full response.
The last non-DONE chunk carries usage. PrivateMind always requests stream_options.include_usage so you get token counts at the end without a second call.
data: [DONE] terminates the stream. It is not JSON. Treat as a sentinel.

Keepalive pings

An idle stream is not a dead one. While the model is silent (slow time-to-first-token, a long tool execution, heavy reasoning) the API emits an SSE comment frame every 10 seconds of upstream silence:

Text

: ping

Comment frames carry no data and never appear inside an event. Spec-compliant SSE parsers (EventSource, the OpenAI SDKs, Vercel's AI SDK) discard them automatically, so most clients need no change. If you parse the stream by hand, ignore any line starting with :.

Two things to expect:

A stream can open with a ping. If the model has not produced its first token yet, the first bytes on the wire may be : ping rather than a data: line.
Pings defeat idle timeouts, not buffering. They exist so proxies and load balancers between you and the API do not kill a connection that looks idle. A buffering proxy still breaks streaming (see the warning below).

SDK examples

Python

from openai import OpenAI
client = OpenAI(base_url="https://api.privatemind.com/v1", api_key="PMIND...:...")

stream = client.chat.completions.create(
    model="<model-id>",
    messages=[{"role": "user", "content": "Write a haiku about TCP."}],
    stream=True,
)
for chunk in stream:
    if chunk.choices and chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Reasoning output

Reasoning models stream their chain-of-thought in a separate field, delta.reasoning, and keep delta.content for the clean final answer, no inline <think>…</think> wrapper. Non-streaming responses carry the same text in message.reasoning.

Text

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"delta":{"reasoning":"Let me work through"},"index":0}]}

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"delta":{"reasoning":" the proof step by step…"},"index":0}]}

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"delta":{"content":"The answer is 42."},"index":0}]}

Render or collapse delta.reasoning as you like; concatenate delta.content for the answer. Feature-detect with reasoning in supported_parameters, and turn thinking on or off per request with reasoning_effort.

When to use streaming

Interactive UIs. Show output as it arrives.
Long generations. Render progress instead of a spinner.
Cost capture mid-flight. Final chunk's usage block tells you the exact spend.

For batch jobs or short responses, the non-streaming endpoint is simpler.

Where next

Chat completions for the non-streaming variant and the full parameter list.
Tool use for how tool calls surface inside a stream.
Errors for how errors appear mid-stream.