Skip to content

Architecture#

┌─────────────────────────────────────────────────────────────┐
│  Client SDK (Python / TS / curl) — base_url=api.siati.ai   │
└─────────────────────────────────────────────────────────────┘
                          │ TLS 1.3
                          ▼
┌─────────────────────────────────────────────────────────────┐
│  NGINX Ingress — siati.ai / api.siati.ai / chat.siati.ai   │
│  Cert: Let's Encrypt prod (auto-renew via cert-manager)    │
└─────────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────┐
│  FastAPI backend (Python)                                   │
│  Auth → rate-limit → quota check → Registry → Inference    │
└─────────────────────────────────────────────────────────────┘
                          │
            ┌─────────────┼─────────────┬──────────────┐
            ▼             ▼             ▼              ▼
        ┌────────┐   ┌─────────┐   ┌────────┐   ┌────────────┐
        │ vLLM   │   │ vLLM    │   │ vLLM   │   │ Ollama     │
        │ 405B   │   │ Mistral │   │ bge-m3 │   │ Mac fleet  │
        │ B6000×4│   │ RTX5090 │   │ L4     │   │ M2 Pro × 2 │
        └────────┘   └─────────┘   └────────┘   └────────────┘
                          │
                          ▼
                  All in Swiss datacenter (Lodrino)

Stack#

  • Inference: open-source vLLM 0.11 on NVIDIA GPUs we own + Ollama on Apple Silicon (M-series) for the low-power slow tier
  • Orchestration: Kubernetes RKE2 v1.31, Cilium CNI, kube-vip for HA control plane, ingress-nginx for L7
  • Models: open-weight from HuggingFace (Llama 3.1, Mistral, Qwen, BGE) + local cache 3.5 TB on Micron 7450 NVMe
  • Backend: FastAPI (Python 3.12), async SQLAlchemy, Redis for rate limit
  • sessions, Postgres 16 for persistence
  • Frontend: Next.js 15 App Router, next-intl 5 languages
  • Payments: Stripe (sk_live), Swiss Tax automatic
  • Hardware: NVIDIA Blackwell B6000-Pro (96 GB) × 4, RTX 5090 (32 GB), L40S, L4 (24 GB)

Nothing in the data path#

  • No US hyperscalers (no AWS/Azure/GCP)
  • No third-party inference (no OpenAI, Anthropic, Together via API)
  • No logging proxy (no Cloudflare Workers, Fly, Vercel Edge)
  • No third-party telemetry (no Datadog, Sentry, New Relic)

Just SIATI staff under Swiss employment contract, on hardware we own, in a Swiss datacenter.