Get API key

Embeddings

Dense vector representations for retrieval, classification, and clustering.

POST /v1/embeddings returns dense vector representations of input text. The shape matches OpenAI's.

curl -s "https://api.privatemind.com/v1/embeddings" \
  -H "Authorization: Bearer $PMIND_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "kalm-embedding-gemma3-12b-2511",
    "input": ["the quick brown fox", "jumps over the lazy dog"]
  }' | jq '.data[].embedding | length'
from openai import OpenAI

client = OpenAI(base_url="https://api.privatemind.com/v1", api_key=PMIND_KEY)
resp = client.embeddings.create(
    model="kalm-embedding-gemma3-12b-2511",
    input=["the quick brown fox", "jumps over the lazy dog"],
)
print(len(resp.data[0].embedding))
import OpenAI from 'openai';

const client = new OpenAI({ baseURL: `https://api.privatemind.com/v1`, apiKey: PMIND_KEY });
const resp = await client.embeddings.create({
  model: 'kalm-embedding-gemma3-12b-2511',
  input: ['the quick brown fox', 'jumps over the lazy dog'],
});
console.log(resp.data[0].embedding.length);

Parameters

  • model (required): id of an embedding model. In GET /v1/models, embedding models report an empty supported_parameters array; identify them by id or modelType.
  • input (required): a single string or an array of strings. Batch processing is more efficient than one call per string.
  • encoding_format (default float): float or base64. Base64 is ~4× smaller on the wire, useful for large batches.
  • dimensions: truncate the output vector to N dimensions, where the model supports it. Lower dimensions trade quality for storage.

Response

JSON
{
  "object": "list",
  "data": [
    { "object": "embedding", "index": 0, "embedding": [0.012, -0.087, ...] },
    { "object": "embedding", "index": 1, "embedding": [0.034, 0.121, ...] }
  ],
  "model": "<embedding-model-id>",
  "usage": { "prompt_tokens": 14, "total_tokens": 14 }
}

completion_tokens is always zero for embeddings. Billing is on input tokens only.

Batching

Send up to several hundred strings per request. Batching is much faster than one-call-per-string — often 10× or more. because tokenisation and dispatch overhead dominate at small batch sizes.

Where next

  • Models to identify embedding models in your org's catalogue.
  • Sources for the built-in ingestion + retrieval pipeline.
  • Rate limits for budget and RPM behaviour on embedding calls.