siati.ai docs

Live data

Infrastructure (fleet)

Live list of all serving backends — LLMs, embeddings, reranker, vector store.

Polled every 30 s by the BackendRouter health check.

Backends total

11

Active

11

Healthy

11

GPUs (active)

6

VRAM (active)

515 GB

LLM serving

Backends that run chat completion models.

Name Kind GPU / VRAM Models Tier weights Status
bigguy-deepseek-r1
BigGuy GPU3 — DeepSeek-R1-Distill-Llama-70B AWQ INT4 (reasoning con chain-of-thought)
vllm 1 ×
97 GB
deepseek-r1-distill-70b
s:0 m:0 f:100 l:100 ● healthy
queue: 0
bigguy-mistral
BigGuy GPU0+1 — Mistral Large 2 (123B) AWQ TP=2 — francese, multilingua, top reasoning
vllm 2 ×
194 GB
mistral-large-2
s:0 m:0 f:100 l:100 ● healthy
queue: 0
bigguy-qwen72b
BigGuy GPU2 — Qwen 2.5 72B Instruct AWQ INT4 su 1× RTX 6000 Pro Blackwell
vllm 1 ×
97 GB
qwen2.5-72b-instruct
s:0 m:0 f:100 l:100 ● healthy
queue: 0
inference-vm-61
VM CPU-only test/fallback. Solo qwen2.5:1.5b come Modellino.
ollama CPU only qwen2.5:1.5b
s:50 m:0 f:0 l:0 ● healthy
queue: 0
l40-b-apertus
NVIDIA L40S 46GB Ada Lovelace 350W. 145 tok/s @ conc=8. Best per fast/medium.
vllm 1 ×
46 GB
apertus-70b-instruct
s:100 m:200 f:500 l:50 ● healthy
queue: 0
mac-mini-13
Apple Silicon M4 16GB. Ottimo per title-generation e modelli piccoli (qwen 1.5B / 7B Q4).
ollama CPU only qwen2.5:1.5b
qwen2.5:7b-instruct-q4_K_M
s:200 m:80 f:0 l:0 ● healthy
queue: 0
mac-mini-7
Apple Silicon M4 16GB. Ottimo per title-generation e modelli piccoli (qwen 1.5B / 7B Q4).
ollama CPU only qwen2.5:1.5b
qwen2.5:7b-instruct-q4_K_M
s:200 m:80 f:0 l:0 ● healthy
queue: 0

Embeddings

Vector embedding model servers.

Name Kind GPU / VRAM Role Status
l40-a-embeddings
NVIDIA L40S, BAAI/bge-m3 multilingual embeddings (1024-dim) via TEI. Used by /v1/embedding...
tei 1 ×
46 GB
bge-m3 ● healthy

Reranker

Cross-encoder reranker servers for two-stage retrieval.

Name Kind GPU / VRAM Role Status
l40-a-reranker
BGE-reranker-v2-m3 cross-encoder. Re-ranks top-30 → top-5 in RAG pipeline. Multilingual, +...
reranker CPU only bge-reranker-v2-m3 ● healthy

Vector store

Persistent vector database for RAG knowledge bases.

Name Kind GPU / VRAM Role Status
bigguy-qdrant
Qdrant 1.18 vector database. Cosine distance, 1024-dim. Backs all RAG knowledge bases.
qdrant CPU only ● healthy

Document parsing

Layout-aware document parser (PDF tables, OCR, multi-column reading order).

Name Kind GPU / VRAM Role Status
bigguy-docling
IBM Docling: layout-aware PDF parsing with table extraction, multi-column reading order, O...
docling CPU only ● healthy

Routing rules

The router computes weight = tier_weights[user_tier] for each LLM backend serving the requested model. Selection: highest weight first, then queue_depth ASC, then gpu_pressure ASC. Backends with weight = 0 are excluded for that tier.

> [!TIP] > To force a specific tier per request, send the X-Siati-Tier header. The router still respects model availability — if no backend serves your model at the requested tier, you get 503.

Services not routed

Embeddings, reranker and vector store backends are not dispatched by the BackendRouter — they're called directly by the relevant services (RAG pipeline, /v1/embeddings endpoint). They appear here for observability.