Architecture#
┌─────────────────────────────────────────────────────────────┐
│ Client SDK (Python / TS / curl) — base_url=api.siati.ai │
└─────────────────────────────────────────────────────────────┘
│ TLS 1.3
▼
┌─────────────────────────────────────────────────────────────┐
│ NGINX Ingress — siati.ai / api.siati.ai / chat.siati.ai │
│ Cert: Let's Encrypt prod (auto-renew via cert-manager) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ FastAPI backend (Python) │
│ Auth → rate-limit → quota check → Registry → Inference │
└─────────────────────────────────────────────────────────────┘
│
┌─────────────┼─────────────┬──────────────┐
▼ ▼ ▼ ▼
┌────────┐ ┌─────────┐ ┌────────┐ ┌────────────┐
│ vLLM │ │ vLLM │ │ vLLM │ │ Ollama │
│ 405B │ │ Mistral │ │ bge-m3 │ │ Mac fleet │
│ B6000×4│ │ RTX5090 │ │ L4 │ │ M2 Pro × 2 │
└────────┘ └─────────┘ └────────┘ └────────────┘
│
▼
All in Swiss datacenter (Lodrino)
Stack#
- Inference: open-source vLLM 0.11 on NVIDIA GPUs we own + Ollama on Apple Silicon (M-series) for the low-power slow tier
- Orchestration: Kubernetes RKE2 v1.31, Cilium CNI, kube-vip for HA control plane, ingress-nginx for L7
- Models: open-weight from HuggingFace (Llama 3.1, Mistral, Qwen, BGE) + local cache 3.5 TB on Micron 7450 NVMe
- Backend: FastAPI (Python 3.12), async SQLAlchemy, Redis for rate limit
- sessions, Postgres 16 for persistence
- Frontend: Next.js 15 App Router, next-intl 5 languages
- Payments: Stripe (sk_live), Swiss Tax automatic
- Hardware: NVIDIA Blackwell B6000-Pro (96 GB) × 4, RTX 5090 (32 GB), L40S, L4 (24 GB)
Nothing in the data path#
- No US hyperscalers (no AWS/Azure/GCP)
- No third-party inference (no OpenAI, Anthropic, Together via API)
- No logging proxy (no Cloudflare Workers, Fly, Vercel Edge)
- No third-party telemetry (no Datadog, Sentry, New Relic)
Just SIATI staff under Swiss employment contract, on hardware we own, in a Swiss datacenter.