Skip to content

Chat completions#

POST /v1/chat/completions — primary endpoint, 100% OpenAI-compatible.

Basic shape#

client.chat.completions.create(
    model="siati/llama-3.1-405b",
    messages=[
        {"role": "system", "content": "You are a Swiss assistant."},
        {"role": "user", "content": "Explain nFADP in 3 bullets."},
    ],
    temperature=0.7,
    max_tokens=500,
)

Supported parameters#

Parameter Type Default Notes
model string required See model catalog.
messages array required Alternating system/user/assistant.
temperature float 1.0 0=deterministic, 2=very creative.
top_p float 1.0 Nucleus sampling, alternative to temperature.
max_tokens int model-default Output cap. If omitted, model decides.
stream bool false See streaming.
tools array Function calling, see tool use.
response_format object {"type":"json_object"} for JSON output.
seed int For reproducibility (best-effort).
stop string|array Stop sequences.
presence_penalty float 0 -2..+2.
frequency_penalty float 0 -2..+2.

Multi-turn conversation#

history = [{"role": "system", "content": "You are a Swiss legal assistant."}]

while True:
    user_input = input("> ")
    if not user_input:
        break
    history.append({"role": "user", "content": user_input})
    resp = client.chat.completions.create(
        model="siati/mistral-small-24b",
        messages=history,
    )
    answer = resp.choices[0].message.content
    print(answer)
    history.append({"role": "assistant", "content": answer})

The model has no memory across calls — you must resend the full history each turn. Cost: you pay for all input tokens on every round.

Response shape#

{
  "id": "chatcmpl-...",
  "object": "chat.completion",
  "created": 1714723200,
  "model": "siati/llama-3.1-405b",
  "choices": [
    {
      "index": 0,
      "message": {"role": "assistant", "content": "..."},
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 42,
    "completion_tokens": 100,
    "total_tokens": 142
  }
}

finish_reason can be: stop (natural), length (hit max_tokens), tool_calls (function call), content_filter (rare, only severe violations).