Streaming#
Add stream=True to receive tokens as they're generated, via Server-Sent
Events.
Python#
stream = client.chat.completions.create(
model="siati/llama-3.1-405b",
messages=[{"role": "user", "content": "Write a Swiss-themed haiku"}],
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)
print()
TypeScript#
const stream = await client.chat.completions.create({
model: "siati/llama-3.1-405b",
messages: [{ role: "user", content: "Write a Swiss-themed haiku" }],
stream: true,
});
for await (const chunk of stream) {
const delta = chunk.choices[0]?.delta?.content;
if (delta) process.stdout.write(delta);
}
console.log();
curl#
curl -N https://api.siati.ai/v1/chat/completions \
-H "Authorization: Bearer siati_..." \
-H "Content-Type: application/json" \
-d '{
"model": "siati/llama-3.1-405b",
"messages": [{"role": "user", "content": "Write a Swiss-themed haiku"}],
"stream": true
}'
-N disables curl buffering so you see chunks in real time.
Chunk format#
Each chunk is a data: <json>\n\n line:
data: {"id":"chatcmpl-...","choices":[{"delta":{"content":"Under "}}]}
data: {"id":"chatcmpl-...","choices":[{"delta":{"content":"Swiss "}}]}
data: {"id":"chatcmpl-...","choices":[{"delta":{"content":"peaks..."}}]}
data: [DONE]
The last event is the literal string [DONE] (NOT a JSON). Official SDKs
handle parsing for you.
Operational notes#
- The ingress default timeout is 600s (10 min). For long-running generations (articles, multi-page translations), consider splitting into multiple calls.
- Streaming does not reduce cost — you still pay for all final tokens. It only reduces perceived latency-to-first-token.
- For chat UIs, use
stream=true. For batch pipelines or tool calling, leavestream=false(default).