siati.ai docs

Concepts

Models

How we pick what to serve, what \"open-weight\" means here, and which model fits which job.

Last updated: 2026-05-24

Models

We serve open-weight large language models on our own hardware in Lugano. No closed APIs from a third party, no opaque billing model. You can audit what's running.

The line-up

For the live, always-current list, go to Models catalog. The shortlist as of today:

Model model_id Use case Origin Why we run it
Apertus 70B Instruct apertus-70b-instruct Default general-purpose, multilingua 🇨🇭 Swiss AI Initiative (EPFL + ETH + CSCS) Trained in Switzerland on Alps supercomputer. Coherent with the sovereign positioning.
Mistral Large 2 (123B) mistral-large-2 Top reasoning, code, multilingua incl. italiano 🇫🇷 Mistral AI European model, eccellente quality + privacy posture.
Qwen 2.5 72B Instruct qwen2.5-72b-instruct Best dense ~70B open: code, math, multilingua 🇨🇳 Alibaba State-of-the-art dense 72B at the time of inclusion.
DeepSeek-R1 Distill 70B deepseek-r1-distill-70b Reasoning with visible chain-of-thought (<think>…</think>) 🇨🇳 DeepSeek (distilled into Llama-70B) Unique transparency on reasoning steps. Great for math, logic, debugging.
Qwen 2.5 7B Instruct (Q4_K_M) qwen2.5:7b-instruct-q4_K_M Fast drafts, low-cost 🇨🇳 Alibaba Compact, multilingua incl. italiano. Runs on Apple Silicon.
Qwen 2.5 1.5B qwen2.5:1.5b Background jobs, title generation 🇨🇳 Alibaba Tiny but capable. Used for auto-titles, edge use cases.
BGE-M3 (embeddings) bge-m3 RAG, semantic search (1024-dim, multilingua) 🇨🇳 BAAI Standard for multilingual embeddings, used in our RAG stack.
BGE-reranker-v2-m3 bge-reranker-v2-m3 RAG 2-stage rerank (cross-encoder) 🇨🇳 BAAI +15-25% recall@5 vs dense-only.

How they map to hardware

Hardware Models served Quant Notes
L40B (1× L40S 46 GB) Apertus 70B W4A16 ~55 tok/s
DGX Spark (GB10 128 GB unified) Apertus 70B W4A16 ~6 tok/s — memory-bandwidth bound, edge / fallback only
BigGuy GPU 0+1 (2× RTX 6000 Pro Blackwell, TP=2) Mistral Large 2 AWQ INT4 ~50-70 tok/s
BigGuy GPU 2 (1× RTX 6000 Pro Blackwell) Qwen 2.5 72B AWQ INT4 ~80-120 tok/s
BigGuy GPU 3 (1× RTX 6000 Pro Blackwell) DeepSeek-R1 Distill 70B AWQ INT4 ~60-80 tok/s
L40A (1× L40S 46 GB) BGE-M3 + BGE-reranker FP16 TEI runtime
2× Mac mini + VM Qwen 2.5 7B + 1.5B Q4_K_M (Ollama) ~30-40 tok/s

The router (BackendRouter) picks the right backend based on (model_id, tier, queue_depth, gpu_pressure). Transparent to you.

How we choose what to run

Three criteria:

  1. License: usable commercially without surprises. Apache 2.0, MIT, Llama 3 license, Apertus license, Mistral Research/Commercial license, etc.
  2. Quality: top of the class for its size category at the time of inclusion.
  3. Hardware fit: must run efficiently on the GPUs we have. We don't add a model if it would degrade throughput across the fleet.

We avoid models that:

  • Phone home on inference (rare in open-weights but always possible)
  • Require gated access we can't audit
  • Embed user-data harvesting in their tooling

Quantization

Most models are served quantized to fit our memory budget and squeeze more throughput:

  • W4A16 (weights 4-bit, activations 16-bit) — our default for vLLM dense models. Roughly 4× smaller than FP16, minimal quality loss.
  • AWQ INT4 — used for Mistral Large 2, Qwen 72B, DeepSeek-R1 Distill on Blackwell.
  • FP8 — supported on Blackwell, used selectively.
  • Q4_K_M (GGUF) — Ollama format for Qwen models on Mac mini.
  • FP16 — only for small models or RAG components (BGE-M3, reranker).

Quantization is invisible to the API caller: the model_id is the same regardless of which quant is loaded.

How to choose

Rough guideline:

Need Recommended model
Italian chat for users (default) apertus-70b-instruct (Swiss) or mistral-large-2 (top quality EU)
Math, logic, code with visible reasoning deepseek-r1-distill-70b
Code generation, complex reasoning qwen2.5-72b-instruct or mistral-large-2
Fast/cheap drafts, classification, title generation qwen2.5:7b-instruct-q4_K_M
Background jobs, minimal latency qwen2.5:1.5b
RAG embeddings bge-m3 (used automatically by /v1/embeddings)

A request always specifies a model_id. The router figures out where to run it.

Tier × Model

Not every model is available at every tier. Tiers are about hardware priority:

  • ludicrous → reserved Blackwell slot (Mistral / Qwen 72B / DeepSeek-R1)
  • fast → Blackwell or L40S production
  • medium → L40S / DGX Spark
  • slow → Apple Silicon / CPU VM

Pick the cheapest tier that meets your latency requirement. See Tiers for the full pricing matrix.

What if we don't have the model you need?

Tell us. Adding a model is mechanical (download → register backend → expose in /v1/models) and we do it on request for enterprise customers — turnaround typically same-day if the model is open-weight and fits our hardware. For deeper customization (fine-tuning your own variant on your own data), the Personalization product is coming Q2 2026.