Chat completions#

POST /v1/chat/completions — primary endpoint, 100% OpenAI-compatible.

Basic shape#

client.chat.completions.create(
    model="siati/llama-3.1-405b",
    messages=[
        {"role": "system", "content": "You are a Swiss assistant."},
        {"role": "user", "content": "Explain nFADP in 3 bullets."},
    ],
    temperature=0.7,
    max_tokens=500,
)

Supported parameters#

Parameter	Type	Default	Notes
`model`	string	required	See model catalog.
`messages`	array	required	Alternating `system`/`user`/`assistant`.
`temperature`	float	1.0	0=deterministic, 2=very creative.
`top_p`	float	1.0	Nucleus sampling, alternative to temperature.
`max_tokens`	int	model-default	Output cap. If omitted, model decides.
`stream`	bool	false	See streaming.
`tools`	array	–	Function calling, see tool use.
`response_format`	object	–	`{"type":"json_object"}` for JSON output.
`seed`	int	–	For reproducibility (best-effort).
`stop`	string\|array	–	Stop sequences.
`presence_penalty`	float	0	-2..+2.
`frequency_penalty`	float	0	-2..+2.

Multi-turn conversation#

history = [{"role": "system", "content": "You are a Swiss legal assistant."}]

while True:
    user_input = input("> ")
    if not user_input:
        break
    history.append({"role": "user", "content": user_input})
    resp = client.chat.completions.create(
        model="siati/mistral-small-24b",
        messages=history,
    )
    answer = resp.choices[0].message.content
    print(answer)
    history.append({"role": "assistant", "content": answer})

The model has no memory across calls — you must resend the full history each turn. Cost: you pay for all input tokens on every round.

Response shape#

{
  "id": "chatcmpl-...",
  "object": "chat.completion",
  "created": 1714723200,
  "model": "siati/llama-3.1-405b",
  "choices": [
    {
      "index": 0,
      "message": {"role": "assistant", "content": "..."},
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 42,
    "completion_tokens": 100,
    "total_tokens": 142
  }
}

finish_reason can be: stop (natural), length (hit max_tokens), tool_calls (function call), content_filter (rare, only severe violations).