Changelog

What changed, when, and why. Reverse chronological. We try to keep this honest.

2026-05-24 — Multi-model BigGuy + chat UX polish

Catalogue expansion — three frontier open-weight models on BigGuy (4× RTX 6000 Pro Blackwell):

Mistral Large 2 (123B, AWQ INT4, TP=2 on GPU 0+1) — French, top reasoning + code + multilingua incl. italiano. European model for the "EU sovereign" narrative.
Qwen 2.5 72B Instruct (AWQ INT4, single GPU 2) — best dense ~70B open at time of inclusion. Throughput on Blackwell: ~80-120 tok/s.
DeepSeek-R1 Distill Llama-70B (AWQ INT4, single GPU 3) — reasoning model with visible chain-of-thought (<think>…</think> blocks). Excellent for math/logic/debugging — wow factor for demos.

Llama 3.1 405B AWQ removed: tested but the AWQ INT4 quantization of 405B degrades quality noticeably (artefacts on instructed tasks). Three smaller-but-better models on the same hardware = better cluster utilization.

DGX Spark benchmark confirms Apertus 70B runs at ~6 tok/s vs ~55 tok/s on a single L40S — Spark is memory-bandwidth-bound (LPDDR5X). Kept as fallback / edge serving only.

Chat UX overhaul on chat.siati.ai:

Token-by-token streaming via Livewire 3 $this->stream() + curl_multi non-blocking in VllmBackend::chatStream() / HttpBackend::streamFromCurl(). Previous "pseudo-streaming" (buffer everything then flush) replaced with true SSE.
Live markdown + KaTeX rendering during streaming: marked + marked-katex-extension + DOMPurify. Math formulas ($$…$$, \[…\], $…$ ) render as you read. Heuristic JS preprocessor recovers [ formula ] patterns that some models use instead of LaTeX delimiters.
Autoscroll with manual lock: scroll follows the generation; wheel-up / touch-up / PageUp from the user pauses autoscroll for 3 seconds; returning within 50px of the bottom resumes it. Critical CSS fix: min-h-0 on the messages container so flex overflow actually activates.
Stop button: a separate POST /stop/{conv_id} endpoint (outside Livewire's serialized action queue) writes a Redis flag; the streaming loop polls it on every chunk and breaks. Partial response is saved.
Auto-tier max: selecting a model from the dropdown snaps the tier to the fastest available for that model (ludicrous > fast > medium > slow). No more manual click.
Per-message metrics stored and displayed: TTFT, total latency, computed tok/s. Persisted in chat_messages.ttft_ms + latency_ms for benchmark queries.
Auto-title background job (GenerateConversationTitle) uses Qwen 2.5 7B to generate a concise title from the first user message. Mobile clients can detect title_job_queued: true in the done SSE event and refetch ~2s later.

Mobile API expanded: POST /api/v1/chat/sessions/{id}/messages now accepts optional backend (hardware family hint) and emits SSE chunks identical to the chat webapp. Full doc at API → Chat sessions.

Mail outbound migrated to dedicated SMTP relay relay01.siati.net:587 with SASL auth (siati-app@siati.ai). Anti-spoof enforced relay-side. Survived audit and rDNS check.

Hostname rename: ciccione (4× RTX 6000 Pro Blackwell host) renamed to bigguy everywhere — code, DB, wiki, slide deck. The previous name was a remnant of dev humour, not fit for client-facing materials.

2026-05-20 — RAG retrieval quality upgrade (reranker + Docling + hybrid)

Three pieces, each tackling a different weakness of vanilla dense RAG:

Reranker — BAAI/bge-reranker-v2-m3 deployed via TEI on the L40A GPU (http://…:8081/rerank). The pipeline now oversamples 30 candidates from Qdrant, then a cross-encoder re-scores them jointly with the query and returns the best 5. Measurable +15–25% recall@5 on standard benchmarks; tested on internal QA pairs with similar uplift.
Docling — IBM Docling Serve container deployed on the BigGuy GPU host (http://…:8090/v1/convert/file). Replaces pdftotext as the primary PDF parser. Preserves tables (structured markdown), reading order on multi-column layouts, figure-caption pairing, and falls back to OCR on scanned pages. pdftotext retained as automatic fallback if Docling is unreachable.
Hybrid search — Qdrant collections recreated with both dense (BGE-M3) and sparse (BM25) vectors. Query path uses Reciprocal Rank Fusion (RRF) natively in Qdrant 1.18+. Captures exact-term matches (codes, names, acronyms) that dense embeddings tend to bury.

Architecture diagram and tunable knobs documented in Concepts → RAG.

Cleanup: all pre-upgrade KBs were wiped (Qdrant collections + Postgres rows + filesystem) since the old schema is not hybrid-compatible.

2026-05-19 — RAG goes live + fleet expansion + docs rebuild

RAG (Retrieval-Augmented Generation)

End-to-end RAG service launched in dashboard and API.
Vector store: Qdrant on Blackwell host, persistent on dedicated NVMe.
Embeddings: BGE-M3 (multilingual, 1024-dim) served on dedicated L40S GPU.
LLM: Apertus 70B by default; any catalog model selectable per request.
UI: /dashboard/rag for non-developers — drag-and-drop PDF/DOCX/MD/TXT, ask questions, cited answers.
API: POST /api/v1/rag/kb, POST /kb/{slug}/docs, POST /kb/{slug}/chat.
Supported formats: PDF (via pdftotext), DOCX, Markdown, plain text.
End-to-end latency: ~6 s for a 5-chunk retrieval + 100-token answer on Apertus.

Fleet

L40A — second L40S online for embeddings (BGE-M3 hosting). 100 GbE.
L40B — first Apertus serving node (vLLM, W4A16 quant). 145 tok/s @ conc=8.
DGX Spark — Grace+Blackwell ARM, second Apertus serving node. 49 tok/s @ conc=8, ~3× lower TDP than L40S — best for slow tier.
BigGuy — 4× RTX 6000 Pro Blackwell, downloading DeepSeek-V3-AWQ (327 GB).
Mac mini × 2 — Ollama serving Qwen 2.5 1.5b + 7b. Used for slow tier and title generation.
PTP 100 GbE link between two GPU hosts benched at 58 Gbps TCP, MTU 9000, 0.2 ms latency — ready for cross-host pipeline-parallel.
MikroTik ROSE Data Server RDS2216 as gateway (ARM64 16 cores, 32 GB RAM, 2×100 GbE + 6×10 GbE).
Juniper QFX5200-32C-32Q as core L2 switch (32×100 GbE + 32×40 GbE).

Backend infrastructure

backends + backend_models tables added: pool-based fleet management with per-tier weights.
BackendRouter service routes requests by (tier, model) using weighted ranking + queue depth tiebreaker.
VllmBackend + HttpBackend (Ollama): two client implementations, swappable per backend kind.
InferenceHealthCheck cron polls every 30 s and updates is_healthy, queue_depth, gpu_pressure.
Filament admin at /admin/backends for live fleet management without redeploy.

Public site & wiki

CDA-driven copy rewrites: hero, "Provalo, smetti di subire il CLOUD Act" section, "Due modi di lavorare", trust strip, removed several stale sections.
Pricing UI: 4-card plan grid with toggle Monthly/Yearly, struck-through original on annual selection.
Top-up selector: pill of 5 amounts (5/25/50/100/500 CHF), CHF → tokens estimate at slow tier, dynamic Acquista token CTA.
This wiki: rebuilt from scratch — multi-page, markdown-backed, Cmd+K search, live data pages, copy buttons on code blocks, anchor headings, GH-style callouts.

Pricing model

Removed deprecated embeddings tier (only 4 tiers: slow/medium/fast/ludicrous).
Removed surcharge_pct from tier_settings (latent bug — was multiplying costs ×2; fixed and gone).
Split Plan.price_chf_month + Plan.stripe_price_id into Product table; products is now the single source of truth for prices.
Added is_highlighted flag on Plan (was hardcoded to pro in templates).
Added is_visible_on_home flag on ModelCard (home shows curated set, /models shows full).

2026-05-17 — Production platform live

siati.ai public site live at https://siati.ai.
Dashboard at https://my.siati.ai (registration, billing, playground, API keys).
Wiki at https://wiki.siati.ai (single-page initially — now superseded).
Mobile API contract /api/v1/* implemented (auth, /me/boot, chat sessions, IAP verify, billing v2).
Stripe integration (subscription checkout + topup), Apple/Google IAP verification endpoint.
Cookie consent + Google Analytics gated by consent.
Multi-domain Laravel app: siati.ai, my.siati.ai, api.siati.ai, wiki.siati.ai, chat.siati.ai.

2026-05-15 — Backend rebuild started

Started from scratch on Laravel 13 + Postgres 16 + Redis + Filament 4.
Migrated user accounts + API keys from legacy backend (gigia_) preserving Argon2id hashes.
Implemented BearerApiKey middleware (HMAC-SHA256 hashing with salt).
JWT auth for mobile API (firebase/php-jwt, 30-day TTL).

Future entries will continue here. Anything material to the API or pricing is logged.

Changelog

2026-05-24 — Multi-model BigGuy + chat UX polish#

2026-05-20 — RAG retrieval quality upgrade (reranker + Docling + hybrid)#

2026-05-19 — RAG goes live + fleet expansion + docs rebuild#

RAG (Retrieval-Augmented Generation)#

Fleet#

Backend infrastructure#

Public site & wiki#

Pricing model#

2026-05-17 — Production platform live#

2026-05-15 — Backend rebuild started#