API reference
Chat sessions (mobile / app contract)
Persistent conversations with auto-title, SSE streaming, and stop-on-demand. The contract used by the iOS/Android apps.
Last updated: 2026-05-24
Chat sessions API
The developer API at api.siati.ai (Bearer key, stateless /v1/chat/completions) is the OpenAI-compatible entry point. For mobile apps and persistent chat UIs, we expose a second, session-oriented API at my.siati.ai/api/v1/chat/sessions/* with:
- Stateful conversations stored server-side (no need to ship history every request)
- Server-Sent Events streaming with chunk-level delivery
- Auto-titled conversations from the first user message
- Per-message metrics (TTFT, latency, tok/s, prompt/completion tokens)
- Stop-generation endpoint for interrupting long responses
Authentication is JWT Bearer (issued at app login), not API key.
Base URL & auth
Base URL: https://my.siati.ai/api/v1
Auth: Authorization: Bearer <jwt>
JWT is obtained from POST /api/v1/auth/login with email/password (returns 30-day JWT). The token is stateless and the app keeps it in secure storage (Keychain on iOS, KeyStore on Android).
Endpoints
List conversations
GET /chat/sessions
Returns the user's last 200 non-archived conversations (no messages, just metadata).
[
{
"id": "019e4c12-3a4b-...",
"title": "Teorema di Pitagora",
"model": "apertus-70b-instruct",
"created_at": "2026-05-24T08:12:00+00:00",
"updated_at": "2026-05-24T08:14:32+00:00",
"messages": []
}
]
Get conversation with messages
GET /chat/sessions/{id}
Returns the full conversation including all messages.
{
"id": "019e4c12-3a4b-...",
"title": "Teorema di Pitagora",
"model": "apertus-70b-instruct",
"created_at": "...",
"updated_at": "...",
"messages": [
{
"id": "...",
"role": "user",
"content": "Spiegami il teorema di Pitagora",
"created_at": "...",
"prompt_tokens": 0,
"completion_tokens": 0,
"latency_ms": 0
},
{
"id": "...",
"role": "assistant",
"content": "Il teorema di Pitagora afferma…",
"created_at": "...",
"prompt_tokens": 24,
"completion_tokens": 187,
"latency_ms": 0
}
]
}
Create conversation
POST /chat/sessions
Content-Type: application/json
{
"title": "My new chat", // optional, defaults to "Nuova conversazione"
"model": "apertus-70b-instruct" // optional, defaults to qwen2.5:1.5b
}
Returns the new conversation (201 Created). Use the returned id for subsequent messages.
Rename / archive
PATCH /chat/sessions/{id}
Content-Type: application/json
{
"title": "Renamed conversation"
}
DELETE /chat/sessions/{id}
DELETE is a hard delete — conversation and all messages are removed (FK cascade). Use the dashboard "Archivia" if you want soft-archive instead.
Send a message (the main one)
POST /chat/sessions/{id}/messages
Content-Type: application/json
Accept: text/event-stream
{
"content": "Spiegami il teorema di Pitagora",
"model": "apertus-70b-instruct",
"tier": "fast", // optional: slow|medium|fast|ludicrous
"backend": "spark" // optional: family name (mac-mini, l40-b, spark, bigguy, inference-vm)
}
Response: text/event-stream (SSE). Chunks arrive in order:
data: {"type":"delta","text":"Il teorema "}
data: {"type":"delta","text":"di Pitagora "}
data: {"type":"delta","text":"afferma che "}
...
data: {"type":"done","prompt_tokens":24,"completion_tokens":187,"latency_ms":4823,"outcome":"ok","title_job_queued":true}
Event types:
type |
Fields | Meaning |
|---|---|---|
delta |
text |
Append text to the assistant message buffer |
done |
prompt_tokens, completion_tokens, latency_ms, outcome, title_job_queued |
Stream complete. Save the accumulated assistant message. |
error |
code, message |
Upstream/inference error. outcome will be error on the next done. |
About title_job_queued
When the response is for the first user message of a conversation, the server queues a background job (GenerateConversationTitle) that uses Qwen 2.5 7B to write a concise title. The job typically completes in 1-2 seconds after the done event.
The flag title_job_queued: true in the done event tells you to poll GET /chat/sessions/{id} once after ~2 seconds to pick up the new title. After that, the title is stable.
About tier and backend
Both optional. If omitted, defaults are:
tier: inherited from previous turn (orslowfor new conv)backend: router's choice — picks the least-loaded family that serves the model
When you pass backend, you scope the routing to a hardware family (e.g. mac-mini = all Mac mini Ollama backends; the router still load-balances within that family).
Stop generation
POST /stop/{id}
Same domain (chat.siati.ai/stop/...) as the chat webapp — purposely outside of the main API surface so it can be called as a fire-and-forget from a separate connection while a long-running messages POST is still streaming.
Server writes a Redis flag chat:stop:{id} that the streaming loop polls on every chunk. The model finishes the current chunk then breaks, and the assistant message is saved with whatever was generated up to that point, suffixed with \n\n_⏹ Generazione interrotta dall'utente._.
Full client example (TypeScript)
const BASE = 'https://my.siati.ai/api/v1';
const headers = {
'Authorization': `Bearer ${jwt}`,
'Content-Type': 'application/json',
'Accept': 'application/json',
};
// 1) Create a session
const conv = await fetch(`${BASE}/chat/sessions`, {
method: 'POST', headers,
body: JSON.stringify({ model: 'apertus-70b-instruct' }),
}).then(r => r.json());
// 2) Send a message and consume the SSE stream
const resp = await fetch(`${BASE}/chat/sessions/${conv.id}/messages`, {
method: 'POST',
headers: { ...headers, 'Accept': 'text/event-stream' },
body: JSON.stringify({
content: 'Spiegami il teorema di Pitagora con un esempio.',
model: 'apertus-70b-instruct',
tier: 'fast',
}),
});
const reader = resp.body!.getReader();
const decoder = new TextDecoder();
let buf = '';
let assistantText = '';
while (true) {
const { done, value } = await reader.read();
if (done) break;
buf += decoder.decode(value, { stream: true });
// SSE frames: 'data: {…}\n\n'
let frameEnd;
while ((frameEnd = buf.indexOf('\n\n')) !== -1) {
const frame = buf.slice(0, frameEnd);
buf = buf.slice(frameEnd + 2);
if (!frame.startsWith('data: ')) continue;
const ev = JSON.parse(frame.slice(6));
if (ev.type === 'delta') {
assistantText += ev.text;
updateUI(assistantText);
}
if (ev.type === 'done') {
saveMessage(conv.id, assistantText, ev);
if (ev.title_job_queued) {
// Poll once after ~2s to pick up the auto-title
setTimeout(async () => {
const fresh = await fetch(`${BASE}/chat/sessions/${conv.id}`, { headers }).then(r => r.json());
updateTitle(fresh.title);
}, 2000);
}
}
if (ev.type === 'error') {
showError(ev.message);
}
}
}
Errors
| HTTP | When |
|---|---|
401 |
JWT missing/expired/invalid |
404 |
Conversation not found or not owned by you |
422 |
Validation failed (e.g. content empty or too long) |
429 |
Rate limit hit for your plan/tier (see Rate limits) |
503 |
No backend healthy for the requested (model, tier) |
Errors during SSE streaming are emitted as data: {"type":"error", …} followed by data: {"type":"done", …, "outcome":"error"}. The HTTP status is still 200 (the response started successfully).
Differences vs developer API
Developer API (api.siati.ai) |
Sessions API (my.siati.ai/api/v1) |
|
|---|---|---|
| Auth | Bearer API key | Bearer JWT |
| State | Stateless | Server-side conversations |
| Format | OpenAI-compatible | Custom, simplified |
| Streaming | OpenAI SSE chunks | {type:delta,text} / {type:done,…} |
| Auto-title | No | Yes, via background job |
| Stop-on-demand | No (close connection) | POST /stop/{id} |
| History management | You manage | Server manages |
| Use case | Backend integration, OpenAI SDK migration | Mobile apps, web chat UIs |
For a chat app, use this sessions API. For backend code that already speaks OpenAI, use the developer API.