Streaming#

Add stream=True to receive tokens as they're generated, via Server-Sent Events.

Python#

stream = client.chat.completions.create(
    model="siati/llama-3.1-405b",
    messages=[{"role": "user", "content": "Write a Swiss-themed haiku"}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)
print()

TypeScript#

const stream = await client.chat.completions.create({
  model: "siati/llama-3.1-405b",
  messages: [{ role: "user", content: "Write a Swiss-themed haiku" }],
  stream: true,
});

for await (const chunk of stream) {
  const delta = chunk.choices[0]?.delta?.content;
  if (delta) process.stdout.write(delta);
}
console.log();

curl#

curl -N https://api.siati.ai/v1/chat/completions \
  -H "Authorization: Bearer siati_..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "siati/llama-3.1-405b",
    "messages": [{"role": "user", "content": "Write a Swiss-themed haiku"}],
    "stream": true
  }'

-N disables curl buffering so you see chunks in real time.

Chunk format#

Each chunk is a data: <json>\n\n line:

data: {"id":"chatcmpl-...","choices":[{"delta":{"content":"Under "}}]}

data: {"id":"chatcmpl-...","choices":[{"delta":{"content":"Swiss "}}]}

data: {"id":"chatcmpl-...","choices":[{"delta":{"content":"peaks..."}}]}

data: [DONE]

The last event is the literal string [DONE] (NOT a JSON). Official SDKs handle parsing for you.

Operational notes#

The ingress default timeout is 600s (10 min). For long-running generations (articles, multi-page translations), consider splitting into multiple calls.
Streaming does not reduce cost — you still pay for all final tokens. It only reduces perceived latency-to-first-token.
For chat UIs, use stream=true. For batch pipelines or tool calling, leave stream=false (default).