Concepts

Models

How we pick what to serve, what \"open-weight\" means here, and which model fits which job.

Last updated: 2026-05-24

Models

We serve open-weight large language models on our own hardware in Lugano. No closed APIs from a third party, no opaque billing model. You can audit what's running.

The line-up

For the live, always-current list, go to Models catalog. The shortlist as of today:

Model	`model_id`	Use case	Origin	Why we run it
Apertus 70B Instruct	`apertus-70b-instruct`	Default general-purpose, multilingua	🇨🇭 Swiss AI Initiative (EPFL + ETH + CSCS)	Trained in Switzerland on Alps supercomputer. Coherent with the sovereign positioning.
Mistral Large 2 (123B)	`mistral-large-2`	Top reasoning, code, multilingua incl. italiano	🇫🇷 Mistral AI	European model, eccellente quality + privacy posture.
Qwen 2.5 72B Instruct	`qwen2.5-72b-instruct`	Best dense ~70B open: code, math, multilingua	🇨🇳 Alibaba	State-of-the-art dense 72B at the time of inclusion.
DeepSeek-R1 Distill 70B	`deepseek-r1-distill-70b`	Reasoning with visible chain-of-thought (`<think>…</think>`)	🇨🇳 DeepSeek (distilled into Llama-70B)	Unique transparency on reasoning steps. Great for math, logic, debugging.
Qwen 2.5 7B Instruct (Q4_K_M)	`qwen2.5:7b-instruct-q4_K_M`	Fast drafts, low-cost	🇨🇳 Alibaba	Compact, multilingua incl. italiano. Runs on Apple Silicon.
Qwen 2.5 1.5B	`qwen2.5:1.5b`	Background jobs, title generation	🇨🇳 Alibaba	Tiny but capable. Used for auto-titles, edge use cases.
BGE-M3 (embeddings)	`bge-m3`	RAG, semantic search (1024-dim, multilingua)	🇨🇳 BAAI	Standard for multilingual embeddings, used in our RAG stack.
BGE-reranker-v2-m3	`bge-reranker-v2-m3`	RAG 2-stage rerank (cross-encoder)	🇨🇳 BAAI	+15-25% recall@5 vs dense-only.

How they map to hardware

Hardware	Models served	Quant	Notes
L40B (1× L40S 46 GB)	Apertus 70B	W4A16	~55 tok/s
DGX Spark (GB10 128 GB unified)	Apertus 70B	W4A16	~6 tok/s — memory-bandwidth bound, edge / fallback only
BigGuy GPU 0+1 (2× RTX 6000 Pro Blackwell, TP=2)	Mistral Large 2	AWQ INT4	~50-70 tok/s
BigGuy GPU 2 (1× RTX 6000 Pro Blackwell)	Qwen 2.5 72B	AWQ INT4	~80-120 tok/s
BigGuy GPU 3 (1× RTX 6000 Pro Blackwell)	DeepSeek-R1 Distill 70B	AWQ INT4	~60-80 tok/s
L40A (1× L40S 46 GB)	BGE-M3 + BGE-reranker	FP16	TEI runtime
2× Mac mini + VM	Qwen 2.5 7B + 1.5B	Q4_K_M (Ollama)	~30-40 tok/s

The router (BackendRouter) picks the right backend based on (model_id, tier, queue_depth, gpu_pressure). Transparent to you.

How we choose what to run

Three criteria:

License: usable commercially without surprises. Apache 2.0, MIT, Llama 3 license, Apertus license, Mistral Research/Commercial license, etc.
Quality: top of the class for its size category at the time of inclusion.
Hardware fit: must run efficiently on the GPUs we have. We don't add a model if it would degrade throughput across the fleet.

We avoid models that:

Phone home on inference (rare in open-weights but always possible)
Require gated access we can't audit
Embed user-data harvesting in their tooling

Quantization

Most models are served quantized to fit our memory budget and squeeze more throughput:

W4A16 (weights 4-bit, activations 16-bit) — our default for vLLM dense models. Roughly 4× smaller than FP16, minimal quality loss.
AWQ INT4 — used for Mistral Large 2, Qwen 72B, DeepSeek-R1 Distill on Blackwell.
FP8 — supported on Blackwell, used selectively.
Q4_K_M (GGUF) — Ollama format for Qwen models on Mac mini.
FP16 — only for small models or RAG components (BGE-M3, reranker).

Quantization is invisible to the API caller: the model_id is the same regardless of which quant is loaded.

How to choose

Rough guideline:

Need	Recommended model
Italian chat for users (default)	`apertus-70b-instruct` (Swiss) or `mistral-large-2` (top quality EU)
Math, logic, code with visible reasoning	`deepseek-r1-distill-70b`
Code generation, complex reasoning	`qwen2.5-72b-instruct` or `mistral-large-2`
Fast/cheap drafts, classification, title generation	`qwen2.5:7b-instruct-q4_K_M`
Background jobs, minimal latency	`qwen2.5:1.5b`
RAG embeddings	`bge-m3` (used automatically by `/v1/embeddings`)

A request always specifies a model_id. The router figures out where to run it.

Tier × Model

Not every model is available at every tier. Tiers are about hardware priority:

ludicrous → reserved Blackwell slot (Mistral / Qwen 72B / DeepSeek-R1)
fast → Blackwell or L40S production
medium → L40S / DGX Spark
slow → Apple Silicon / CPU VM

Pick the cheapest tier that meets your latency requirement. See Tiers for the full pricing matrix.

What if we don't have the model you need?

Tell us. Adding a model is mechanical (download → register backend → expose in /v1/models) and we do it on request for enterprise customers — turnaround typically same-day if the model is open-weight and fits our hardware. For deeper customization (fine-tuning your own variant on your own data), the Personalization product is coming Q2 2026.

Models

The line-up#

How they map to hardware#

How we choose what to run#

Quantization#

How to choose#

Tier × Model#

What if we don't have the model you need?#

The line-up

How they map to hardware

How we choose what to run

Quantization

How to choose

Tier × Model

What if we don't have the model you need?