Live data

Infrastructure (fleet)

Live list of all serving backends — LLMs, embeddings, reranker, vector store.

Polled every 30 s by the BackendRouter health check.

Backends total

Active

Healthy

GPUs (active)

VRAM (active)

176 GB

LLM serving

Backends that run chat completion models.

Name	Kind	GPU / VRAM	Models	Tier weights	Status
gpu-amd-r9700 AMD Radeon AI PRO R9700 32GB (RDNA4, ROCm 7.2.4) - Lugano. Modelli 14B-32B.	`ollama`	1 × 32 GB	`qwen2.5:32b` `qwen2.5:14b`	s:50 m:80 f:100 l:50	● healthy queue: 0
gpu-apertus-l40s NVIDIA L40S 46GB Ada Lovelace 350W. 145 tok/s @ conc=8. Best per fast/medium.	`vllm`	1 × 48 GB	`apertus-70b-instruct`	s:100 m:200 f:500 l:50	● healthy queue: 0
gpu-gemma-l40s RTX 5090 32GB (Blackwell). Gemma 4 26B-A4B FP8-Dynamic, kv-cache fp8. ~54 tok/s.	`vllm`	1 × 48 GB	`gemma-4-26b`	s:40 m:80 f:100 l:80	● healthy queue: 0
inference-vm-61 VM CPU-only test/fallback. Solo qwen2.5:1.5b come Modellino.	`ollama`	CPU only	`qwen2.5:1.5b`	s:50 m:0 f:0 l:0	● healthy queue: 0
mac-mini-13 Apple Silicon M4 16GB. Ottimo per title-generation e modelli piccoli (qwen 1.5B / 7B Q4).	`ollama`	1 × 16 GB	`qwen2.5:1.5b` `qwen2.5:7b-instruct-q4_K_M`	s:200 m:80 f:80 l:0	○ pending
mac-mini-14 Apple Silicon M2 Pro 16GB (Lugano). qwen 1.5B/7B.	`ollama`	1 × 16 GB	`qwen2.5:1.5b` `qwen2.5:7b-instruct-q4_K_M`	s:200 m:80 f:80 l:0	○ pending
mac-mini-6 Apple Silicon M2 Pro 16GB (Lugano). qwen 1.5B/7B.	`ollama`	1 × 16 GB	`qwen2.5:1.5b` `qwen2.5:7b-instruct-q4_K_M`	s:200 m:80 f:80 l:0	○ pending
mac-mini-7 Apple Silicon M4 16GB. Ottimo per title-generation e modelli piccoli (qwen 1.5B / 7B Q4).	`ollama`	1 × 16 GB	`qwen2.5:1.5b` `qwen2.5:7b-instruct-q4_K_M`	s:200 m:80 f:80 l:0	○ pending

Embeddings

Vector embedding model servers.

Name	Kind	GPU / VRAM	Role		Status
gpu-svc-embeddings NVIDIA L40S, BAAI/bge-m3 multilingual embeddings (1024-dim) via TEI. Used by /v1/embedding...	`tei`	1 × 24 GB	`bge-m3`		● healthy

Reranker

Cross-encoder reranker servers for two-stage retrieval.

Name	Kind	GPU / VRAM	Role		Status
gpu-svc-reranker BGE-reranker-v2-m3 cross-encoder. Re-ranks top-30 → top-5 in RAG pipeline. Multilingual, +...	`reranker`	1 × 24 GB	`bge-reranker-v2-m3`		● healthy

Routing rules

The router computes weight = tier_weights[user_tier] for each LLM backend serving the requested model. Selection: highest weight first, then queue_depth ASC, then gpu_pressure ASC. Backends with weight = 0 are excluded for that tier.

> [!TIP] > To force a specific tier per request, send the X-Siati-Tier header. The router still respects model availability — if no backend serves your model at the requested tier, you get 503.

Services not routed

Embeddings, reranker and vector store backends are not dispatched by the BackendRouter — they're called directly by the relevant services (RAG pipeline, /v1/embeddings endpoint). They appear here for observability.