Concepts
Models
How we pick what to serve, what \"open-weight\" means here, and which model fits which job.
Last updated: 2026-05-24
Models
We serve open-weight large language models on our own hardware in Lugano. No closed APIs from a third party, no opaque billing model. You can audit what's running.
The line-up
For the live, always-current list, go to Models catalog. The shortlist as of today:
| Model | model_id |
Use case | Origin | Why we run it |
|---|---|---|---|---|
| Apertus 70B Instruct | apertus-70b-instruct |
Default general-purpose, multilingua | 🇨🇭 Swiss AI Initiative (EPFL + ETH + CSCS) | Trained in Switzerland on Alps supercomputer. Coherent with the sovereign positioning. |
| Mistral Large 2 (123B) | mistral-large-2 |
Top reasoning, code, multilingua incl. italiano | 🇫🇷 Mistral AI | European model, eccellente quality + privacy posture. |
| Qwen 2.5 72B Instruct | qwen2.5-72b-instruct |
Best dense ~70B open: code, math, multilingua | 🇨🇳 Alibaba | State-of-the-art dense 72B at the time of inclusion. |
| DeepSeek-R1 Distill 70B | deepseek-r1-distill-70b |
Reasoning with visible chain-of-thought (<think>…</think>) |
🇨🇳 DeepSeek (distilled into Llama-70B) | Unique transparency on reasoning steps. Great for math, logic, debugging. |
| Qwen 2.5 7B Instruct (Q4_K_M) | qwen2.5:7b-instruct-q4_K_M |
Fast drafts, low-cost | 🇨🇳 Alibaba | Compact, multilingua incl. italiano. Runs on Apple Silicon. |
| Qwen 2.5 1.5B | qwen2.5:1.5b |
Background jobs, title generation | 🇨🇳 Alibaba | Tiny but capable. Used for auto-titles, edge use cases. |
| BGE-M3 (embeddings) | bge-m3 |
RAG, semantic search (1024-dim, multilingua) | 🇨🇳 BAAI | Standard for multilingual embeddings, used in our RAG stack. |
| BGE-reranker-v2-m3 | bge-reranker-v2-m3 |
RAG 2-stage rerank (cross-encoder) | 🇨🇳 BAAI | +15-25% recall@5 vs dense-only. |
How they map to hardware
| Hardware | Models served | Quant | Notes |
|---|---|---|---|
| L40B (1× L40S 46 GB) | Apertus 70B | W4A16 | ~55 tok/s |
| DGX Spark (GB10 128 GB unified) | Apertus 70B | W4A16 | ~6 tok/s — memory-bandwidth bound, edge / fallback only |
| BigGuy GPU 0+1 (2× RTX 6000 Pro Blackwell, TP=2) | Mistral Large 2 | AWQ INT4 | ~50-70 tok/s |
| BigGuy GPU 2 (1× RTX 6000 Pro Blackwell) | Qwen 2.5 72B | AWQ INT4 | ~80-120 tok/s |
| BigGuy GPU 3 (1× RTX 6000 Pro Blackwell) | DeepSeek-R1 Distill 70B | AWQ INT4 | ~60-80 tok/s |
| L40A (1× L40S 46 GB) | BGE-M3 + BGE-reranker | FP16 | TEI runtime |
| 2× Mac mini + VM | Qwen 2.5 7B + 1.5B | Q4_K_M (Ollama) | ~30-40 tok/s |
The router (BackendRouter) picks the right backend based on (model_id, tier, queue_depth, gpu_pressure). Transparent to you.
How we choose what to run
Three criteria:
- License: usable commercially without surprises. Apache 2.0, MIT, Llama 3 license, Apertus license, Mistral Research/Commercial license, etc.
- Quality: top of the class for its size category at the time of inclusion.
- Hardware fit: must run efficiently on the GPUs we have. We don't add a model if it would degrade throughput across the fleet.
We avoid models that:
- Phone home on inference (rare in open-weights but always possible)
- Require gated access we can't audit
- Embed user-data harvesting in their tooling
Quantization
Most models are served quantized to fit our memory budget and squeeze more throughput:
- W4A16 (weights 4-bit, activations 16-bit) — our default for vLLM dense models. Roughly 4× smaller than FP16, minimal quality loss.
- AWQ INT4 — used for Mistral Large 2, Qwen 72B, DeepSeek-R1 Distill on Blackwell.
- FP8 — supported on Blackwell, used selectively.
- Q4_K_M (GGUF) — Ollama format for Qwen models on Mac mini.
- FP16 — only for small models or RAG components (BGE-M3, reranker).
Quantization is invisible to the API caller: the model_id is the same regardless of which quant is loaded.
How to choose
Rough guideline:
| Need | Recommended model |
|---|---|
| Italian chat for users (default) | apertus-70b-instruct (Swiss) or mistral-large-2 (top quality EU) |
| Math, logic, code with visible reasoning | deepseek-r1-distill-70b |
| Code generation, complex reasoning | qwen2.5-72b-instruct or mistral-large-2 |
| Fast/cheap drafts, classification, title generation | qwen2.5:7b-instruct-q4_K_M |
| Background jobs, minimal latency | qwen2.5:1.5b |
| RAG embeddings | bge-m3 (used automatically by /v1/embeddings) |
A request always specifies a model_id. The router figures out where to run it.
Tier × Model
Not every model is available at every tier. Tiers are about hardware priority:
ludicrous→ reserved Blackwell slot (Mistral / Qwen 72B / DeepSeek-R1)fast→ Blackwell or L40S productionmedium→ L40S / DGX Sparkslow→ Apple Silicon / CPU VM
Pick the cheapest tier that meets your latency requirement. See Tiers for the full pricing matrix.
What if we don't have the model you need?
Tell us. Adding a model is mechanical (download → register backend → expose in /v1/models) and we do it on request for enterprise customers — turnaround typically same-day if the model is open-weight and fits our hardware. For deeper customization (fine-tuning your own variant on your own data), the Personalization product is coming Q2 2026.