Live data
Infrastructure (fleet)
Live list of all serving backends — LLMs, embeddings, reranker, vector store.
Polled every 30 s by the BackendRouter health check.
Backends total
11
Active
11
Healthy
11
GPUs (active)
6
VRAM (active)
515 GB
LLM serving
Backends that run chat completion models.
| Name | Kind | GPU / VRAM | Models | Tier weights | Status |
|---|---|---|---|---|---|
|
bigguy-deepseek-r1
BigGuy GPU3 — DeepSeek-R1-Distill-Llama-70B AWQ INT4 (reasoning con chain-of-thought) |
vllm |
1 × 97 GB |
deepseek-r1-distill-70b |
s:0 m:0 f:100 l:100 |
● healthy
queue: 0 |
|
bigguy-mistral
BigGuy GPU0+1 — Mistral Large 2 (123B) AWQ TP=2 — francese, multilingua, top reasoning |
vllm |
2 × 194 GB |
mistral-large-2 |
s:0 m:0 f:100 l:100 |
● healthy
queue: 0 |
|
bigguy-qwen72b
BigGuy GPU2 — Qwen 2.5 72B Instruct AWQ INT4 su 1× RTX 6000 Pro Blackwell |
vllm |
1 × 97 GB |
qwen2.5-72b-instruct |
s:0 m:0 f:100 l:100 |
● healthy
queue: 0 |
|
inference-vm-61
VM CPU-only test/fallback. Solo qwen2.5:1.5b come Modellino. |
ollama |
CPU only |
qwen2.5:1.5b |
s:50 m:0 f:0 l:0 |
● healthy
queue: 0 |
|
l40-b-apertus
NVIDIA L40S 46GB Ada Lovelace 350W. 145 tok/s @ conc=8. Best per fast/medium. |
vllm |
1 × 46 GB |
apertus-70b-instruct |
s:100 m:200 f:500 l:50 |
● healthy
queue: 0 |
|
mac-mini-13
Apple Silicon M4 16GB. Ottimo per title-generation e modelli piccoli (qwen 1.5B / 7B Q4). |
ollama |
CPU only |
qwen2.5:1.5bqwen2.5:7b-instruct-q4_K_M |
s:200 m:80 f:0 l:0 |
● healthy
queue: 0 |
|
mac-mini-7
Apple Silicon M4 16GB. Ottimo per title-generation e modelli piccoli (qwen 1.5B / 7B Q4). |
ollama |
CPU only |
qwen2.5:1.5bqwen2.5:7b-instruct-q4_K_M |
s:200 m:80 f:0 l:0 |
● healthy
queue: 0 |
Embeddings
Vector embedding model servers.
| Name | Kind | GPU / VRAM | Role | Status | |
|---|---|---|---|---|---|
|
l40-a-embeddings
NVIDIA L40S, BAAI/bge-m3 multilingual embeddings (1024-dim) via TEI. Used by /v1/embedding... |
tei |
1 × 46 GB |
bge-m3
|
● healthy | |
Reranker
Cross-encoder reranker servers for two-stage retrieval.
| Name | Kind | GPU / VRAM | Role | Status | |
|---|---|---|---|---|---|
|
l40-a-reranker
BGE-reranker-v2-m3 cross-encoder. Re-ranks top-30 → top-5 in RAG pipeline. Multilingual, +... |
reranker |
CPU only |
bge-reranker-v2-m3
|
● healthy | |
Vector store
Persistent vector database for RAG knowledge bases.
| Name | Kind | GPU / VRAM | Role | Status | |
|---|---|---|---|---|---|
|
bigguy-qdrant
Qdrant 1.18 vector database. Cosine distance, 1024-dim. Backs all RAG knowledge bases. |
qdrant |
CPU only | ● healthy | ||
Document parsing
Layout-aware document parser (PDF tables, OCR, multi-column reading order).
| Name | Kind | GPU / VRAM | Role | Status | |
|---|---|---|---|---|---|
|
bigguy-docling
IBM Docling: layout-aware PDF parsing with table extraction, multi-column reading order, O... |
docling |
CPU only | ● healthy | ||
Routing rules
The router computes weight = tier_weights[user_tier] for each LLM backend serving the requested model. Selection: highest weight first, then queue_depth ASC, then gpu_pressure ASC. Backends with weight = 0 are excluded for that tier.
X-Siati-Tier header. The router still respects model availability — if no backend serves your model at the requested tier, you get 503.
Services not routed
Embeddings, reranker and vector store backends are not dispatched by the BackendRouter — they're called directly by the relevant services (RAG pipeline, /v1/embeddings endpoint). They appear here for observability.