Skip to main content

Chat Completions

OpenAI-compatible chat completions endpoint. Drop-in compatible with the official OpenAI SDKs (Python, JS, etc.) — point them at Ragen as the base_url and they'll work unchanged.

POST https://api.ragen.ai/v1/chat/completions

Authentication

Authorization: Bearer YOUR_API_KEY

Each API key is scoped to a specific project — its knowledge base is what the model will retrieve against. Create and manage keys under Settings → API Keys in the dashboard.

Request

Body

FieldTypeRequiredDescription
messagesarrayYes1–100 messages in conversation order. See Message format.
modelstringNoModel id to use (e.g. gpt-5.4). Defaults to the organization's default model. List available models via GET /v1/models.
temperaturenumberNo0–2. Overrides the organization default.
max_tokensintegerNoCap on generated tokens (1–32,000).
streambooleanNoWhen true, responds with a Server-Sent Events stream. Default false.
stream_optionsobjectNoStreaming options. Currently supports include_usage: boolean. See Streaming.

Message format

Each message is { role, content }:

RolePurpose
userThe user's turn. The last user message is treated as the active question; earlier ones become conversation history.
assistantPrior assistant responses. Included in conversation history.
systemPer-request system instructions — merged on top of the project's own instructions (project owner's intent first, then the caller's).

Response

Non-streaming

Returns a chat.completion object:

{
"id": "chatcmpl-7a2b4c...",
"object": "chat.completion",
"created": 1744664400,
"model": "gpt-5.4",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Our refund policy allows returns within 30 days..."
},
"finish_reason": "stop",
"logprobs": null
}
],
"usage": {
"prompt_tokens": 128,
"completion_tokens": 42,
"total_tokens": 170
}
}

Streaming

When stream: true, returns text/event-stream with a sequence of chat.completion.chunk objects, terminated by data: [DONE]:

data: {"id":"chatcmpl-...","object":"chat.completion.chunk","created":1744664400,"model":"gpt-5.4","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null,"logprobs":null}]}

data: {"id":"chatcmpl-...","object":"chat.completion.chunk","created":1744664400,"model":"gpt-5.4","choices":[{"index":0,"delta":{"content":"Our "},"finish_reason":null,"logprobs":null}]}

data: {"id":"chatcmpl-...","object":"chat.completion.chunk","created":1744664400,"model":"gpt-5.4","choices":[{"index":0,"delta":{"content":"refund policy..."},"finish_reason":null,"logprobs":null}]}

data: {"id":"chatcmpl-...","object":"chat.completion.chunk","created":1744664400,"model":"gpt-5.4","choices":[{"index":0,"delta":{},"finish_reason":"stop","logprobs":null}]}

data: [DONE]
  • The first chunk carries delta.role: "assistant" (OpenAI convention).
  • Content chunks carry delta.content.
  • The final chunk before [DONE] has empty delta and finish_reason: "stop".

Including usage in streams

By OpenAI convention, usage is not emitted on streaming responses unless the caller opts in. Pass stream_options: { include_usage: true } to receive a trailing usage chunk:

{
"id": "chatcmpl-...",
"object": "chat.completion.chunk",
"created": 1744664400,
"model": "gpt-5.4",
"choices": [],
"usage": {
"prompt_tokens": 128,
"completion_tokens": 42,
"total_tokens": 170
}
}

The choices: [] signals that this chunk carries usage only, not content.

Examples

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
base_url="https://api.ragen.ai/v1",
api_key="YOUR_API_KEY",
)

resp = client.chat.completions.create(
model="gpt-5.4",
messages=[
{"role": "user", "content": "What is our refund policy?"},
],
)
print(resp.choices[0].message.content)

Python streaming

stream = client.chat.completions.create(
model="gpt-5.4",
messages=[{"role": "user", "content": "Summarize our onboarding process"}],
stream=True,
stream_options={"include_usage": True},
)

for chunk in stream:
if chunk.choices and chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
if chunk.usage:
print(f"\n\nUsed {chunk.usage.total_tokens} tokens")

JavaScript / TypeScript (openai-node)

import OpenAI from "openai";

const client = new OpenAI({
baseURL: "https://api.ragen.ai/v1",
apiKey: process.env.RAGEN_API_KEY,
});

const resp = await client.chat.completions.create({
model: "gpt-5.4",
messages: [{ role: "user", content: "What is our refund policy?" }],
});
console.log(resp.choices[0].message.content);

cURL

curl -X POST https://api.ragen.ai/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-5.4",
"messages": [
{"role": "user", "content": "What is our refund policy?"}
]
}'

Multi-turn conversation

The last user message is treated as the active question; earlier messages form the conversation context.

resp = client.chat.completions.create(
model="gpt-5.4",
messages=[
{"role": "user", "content": "What is our refund policy?"},
{"role": "assistant", "content": "We offer refunds within 30 days of purchase."},
{"role": "user", "content": "What about digital products?"},
],
)

Per-request system prompt

resp = client.chat.completions.create(
model="gpt-5.4",
messages=[
{"role": "system", "content": "Respond only in Polish."},
{"role": "user", "content": "What is our refund policy?"},
],
)

System messages are merged on top of the project's own instructions — the project owner's instructions come first, then the caller's.

Error responses

Errors use the OpenAI error envelope so existing SDK error-handling keeps working:

{
"error": {
"message": "prompt too long",
"type": "invalid_request_error",
"code": "context_length_exceeded",
"param": null
}
}
StatustypeMeaning
400invalid_request_errorMalformed body, validation failure, prompt too long
401authentication_errorMissing or invalid API key
403permission_errorKey deactivated or missing required scope
404not_found_errorProject (assistant) not found
429rate_limit_errorRate limit exceeded
5xxapi_errorUpstream / internal error

Rate limits

Chat completions run a full RAG pipeline (vector search + rerank + LLM call), so they're on the expensive tier:

ScopeLimit
POST /v1/chat/completions10 req/min per IP

Each streaming and non-streaming request counts identically against this budget. When the limit is hit the response is 429 rate_limit_error; back off with jitter.

Differences vs. /v1/chat

The ragen-native POST /v1/chat endpoint is a simpler interface (single content string + optional context) that predates the OpenAI-compatible one. Both are supported:

/v1/chat/v1/chat/completions
FormatRagen-native JSON / SSEOpenAI wire format
Multi-turnNo (single prompt)Yes (messages array)
Model overrideNoYes (model field)
Temperature overrideNoYes
max_tokensNoYes
Page context injectionYes (context field)No — use messages instead
Usage in streamsNoOpt-in via stream_options

Use /v1/chat/completions for any new integration that benefits from the OpenAI SDK ecosystem. Keep /v1/chat for the embed widget and other existing ragen-native consumers.