Chat Completions
OpenAI-compatible chat completions endpoint. Drop-in compatible with the
official OpenAI SDKs (Python, JS, etc.) — point them at Ragen as the
base_url and they'll work unchanged.
POST https://api.ragen.ai/v1/chat/completions
Authentication
Authorization: Bearer YOUR_API_KEY
Each API key is scoped to a specific project — its knowledge base is what the model will retrieve against. Create and manage keys under Settings → API Keys in the dashboard.
Request
Body
| Field | Type | Required | Description |
|---|---|---|---|
messages | array | Yes | 1–100 messages in conversation order. See Message format. |
model | string | No | Model id to use (e.g. gpt-5.4). Defaults to the organization's default model. List available models via GET /v1/models. |
temperature | number | No | 0–2. Overrides the organization default. |
max_tokens | integer | No | Cap on generated tokens (1–32,000). |
stream | boolean | No | When true, responds with a Server-Sent Events stream. Default false. |
stream_options | object | No | Streaming options. Currently supports include_usage: boolean. See Streaming. |
Message format
Each message is { role, content }:
| Role | Purpose |
|---|---|
user | The user's turn. The last user message is treated as the active question; earlier ones become conversation history. |
assistant | Prior assistant responses. Included in conversation history. |
system | Per-request system instructions — merged on top of the project's own instructions (project owner's intent first, then the caller's). |
Response
Non-streaming
Returns a chat.completion object:
{
"id": "chatcmpl-7a2b4c...",
"object": "chat.completion",
"created": 1744664400,
"model": "gpt-5.4",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Our refund policy allows returns within 30 days..."
},
"finish_reason": "stop",
"logprobs": null
}
],
"usage": {
"prompt_tokens": 128,
"completion_tokens": 42,
"total_tokens": 170
}
}
Streaming
When stream: true, returns text/event-stream with a sequence of
chat.completion.chunk objects, terminated by data: [DONE]:
data: {"id":"chatcmpl-...","object":"chat.completion.chunk","created":1744664400,"model":"gpt-5.4","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null,"logprobs":null}]}
data: {"id":"chatcmpl-...","object":"chat.completion.chunk","created":1744664400,"model":"gpt-5.4","choices":[{"index":0,"delta":{"content":"Our "},"finish_reason":null,"logprobs":null}]}
data: {"id":"chatcmpl-...","object":"chat.completion.chunk","created":1744664400,"model":"gpt-5.4","choices":[{"index":0,"delta":{"content":"refund policy..."},"finish_reason":null,"logprobs":null}]}
data: {"id":"chatcmpl-...","object":"chat.completion.chunk","created":1744664400,"model":"gpt-5.4","choices":[{"index":0,"delta":{},"finish_reason":"stop","logprobs":null}]}
data: [DONE]
- The first chunk carries
delta.role: "assistant"(OpenAI convention). - Content chunks carry
delta.content. - The final chunk before
[DONE]has emptydeltaandfinish_reason: "stop".
Including usage in streams
By OpenAI convention, usage is not emitted on streaming responses
unless the caller opts in. Pass stream_options: { include_usage: true }
to receive a trailing usage chunk:
{
"id": "chatcmpl-...",
"object": "chat.completion.chunk",
"created": 1744664400,
"model": "gpt-5.4",
"choices": [],
"usage": {
"prompt_tokens": 128,
"completion_tokens": 42,
"total_tokens": 170
}
}
The choices: [] signals that this chunk carries usage only, not content.
Examples
Python (OpenAI SDK)
from openai import OpenAI
client = OpenAI(
base_url="https://api.ragen.ai/v1",
api_key="YOUR_API_KEY",
)
resp = client.chat.completions.create(
model="gpt-5.4",
messages=[
{"role": "user", "content": "What is our refund policy?"},
],
)
print(resp.choices[0].message.content)
Python streaming
stream = client.chat.completions.create(
model="gpt-5.4",
messages=[{"role": "user", "content": "Summarize our onboarding process"}],
stream=True,
stream_options={"include_usage": True},
)
for chunk in stream:
if chunk.choices and chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
if chunk.usage:
print(f"\n\nUsed {chunk.usage.total_tokens} tokens")
JavaScript / TypeScript (openai-node)
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://api.ragen.ai/v1",
apiKey: process.env.RAGEN_API_KEY,
});
const resp = await client.chat.completions.create({
model: "gpt-5.4",
messages: [{ role: "user", content: "What is our refund policy?" }],
});
console.log(resp.choices[0].message.content);
cURL
curl -X POST https://api.ragen.ai/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-5.4",
"messages": [
{"role": "user", "content": "What is our refund policy?"}
]
}'
Multi-turn conversation
The last user message is treated as the active question; earlier
messages form the conversation context.
resp = client.chat.completions.create(
model="gpt-5.4",
messages=[
{"role": "user", "content": "What is our refund policy?"},
{"role": "assistant", "content": "We offer refunds within 30 days of purchase."},
{"role": "user", "content": "What about digital products?"},
],
)
Per-request system prompt
resp = client.chat.completions.create(
model="gpt-5.4",
messages=[
{"role": "system", "content": "Respond only in Polish."},
{"role": "user", "content": "What is our refund policy?"},
],
)
System messages are merged on top of the project's own instructions — the project owner's instructions come first, then the caller's.
Error responses
Errors use the OpenAI error envelope so existing SDK error-handling keeps working:
{
"error": {
"message": "prompt too long",
"type": "invalid_request_error",
"code": "context_length_exceeded",
"param": null
}
}
| Status | type | Meaning |
|---|---|---|
| 400 | invalid_request_error | Malformed body, validation failure, prompt too long |
| 401 | authentication_error | Missing or invalid API key |
| 403 | permission_error | Key deactivated or missing required scope |
| 404 | not_found_error | Project (assistant) not found |
| 429 | rate_limit_error | Rate limit exceeded |
| 5xx | api_error | Upstream / internal error |
Rate limits
Chat completions run a full RAG pipeline (vector search + rerank + LLM call), so they're on the expensive tier:
| Scope | Limit |
|---|---|
POST /v1/chat/completions | 10 req/min per IP |
Each streaming and non-streaming request counts identically against
this budget. When the limit is hit the response is
429 rate_limit_error; back off with jitter.
Differences vs. /v1/chat
The ragen-native POST /v1/chat endpoint is a simpler
interface (single content string + optional context) that predates
the OpenAI-compatible one. Both are supported:
/v1/chat | /v1/chat/completions | |
|---|---|---|
| Format | Ragen-native JSON / SSE | OpenAI wire format |
| Multi-turn | No (single prompt) | Yes (messages array) |
| Model override | No | Yes (model field) |
| Temperature override | No | Yes |
max_tokens | No | Yes |
| Page context injection | Yes (context field) | No — use messages instead |
| Usage in streams | No | Opt-in via stream_options |
Use /v1/chat/completions for any new integration that benefits from
the OpenAI SDK ecosystem. Keep /v1/chat for the embed widget and
other existing ragen-native consumers.