Changelog
Changelog
Releases and notable platform changes.
Last updated: 2026-05-24
Changelog
What changed, when, and why. Reverse chronological. We try to keep this honest.
2026-05-24 — Multi-model BigGuy + chat UX polish
Catalogue expansion — three frontier open-weight models on BigGuy (4× RTX 6000 Pro Blackwell):
- Mistral Large 2 (123B, AWQ INT4, TP=2 on GPU 0+1) — French, top reasoning + code + multilingua incl. italiano. European model for the "EU sovereign" narrative.
- Qwen 2.5 72B Instruct (AWQ INT4, single GPU 2) — best dense ~70B open at time of inclusion. Throughput on Blackwell: ~80-120 tok/s.
- DeepSeek-R1 Distill Llama-70B (AWQ INT4, single GPU 3) — reasoning model with visible chain-of-thought (
<think>…</think>blocks). Excellent for math/logic/debugging — wow factor for demos.
Llama 3.1 405B AWQ removed: tested but the AWQ INT4 quantization of 405B degrades quality noticeably (artefacts on instructed tasks). Three smaller-but-better models on the same hardware = better cluster utilization.
DGX Spark benchmark confirms Apertus 70B runs at ~6 tok/s vs ~55 tok/s on a single L40S — Spark is memory-bandwidth-bound (LPDDR5X). Kept as fallback / edge serving only.
Chat UX overhaul on chat.siati.ai:
- Token-by-token streaming via Livewire 3
$this->stream()+curl_multinon-blocking inVllmBackend::chatStream()/HttpBackend::streamFromCurl(). Previous "pseudo-streaming" (buffer everything then flush) replaced with true SSE. - Live markdown + KaTeX rendering during streaming:
marked+marked-katex-extension+ DOMPurify. Math formulas ($$…$$,\[…\],$…$) render as you read. Heuristic JS preprocessor recovers[ formula ]patterns that some models use instead of LaTeX delimiters. - Autoscroll with manual lock: scroll follows the generation; wheel-up / touch-up / PageUp from the user pauses autoscroll for 3 seconds; returning within 50px of the bottom resumes it. Critical CSS fix:
min-h-0on the messages container so flex overflow actually activates. - Stop button: a separate
POST /stop/{conv_id}endpoint (outside Livewire's serialized action queue) writes a Redis flag; the streaming loop polls it on every chunk and breaks. Partial response is saved. - Auto-tier max: selecting a model from the dropdown snaps the tier to the fastest available for that model (ludicrous > fast > medium > slow). No more manual click.
- Per-message metrics stored and displayed:
TTFT, total latency, computedtok/s. Persisted inchat_messages.ttft_ms+latency_msfor benchmark queries. - Auto-title background job (
GenerateConversationTitle) uses Qwen 2.5 7B to generate a concise title from the first user message. Mobile clients can detecttitle_job_queued: truein thedoneSSE event and refetch ~2s later.
Mobile API expanded: POST /api/v1/chat/sessions/{id}/messages now accepts optional backend (hardware family hint) and emits SSE chunks identical to the chat webapp. Full doc at API → Chat sessions.
Mail outbound migrated to dedicated SMTP relay relay01.siati.net:587 with SASL auth (siati-app@siati.ai). Anti-spoof enforced relay-side. Survived audit and rDNS check.
Hostname rename: ciccione (4× RTX 6000 Pro Blackwell host) renamed to bigguy everywhere — code, DB, wiki, slide deck. The previous name was a remnant of dev humour, not fit for client-facing materials.
2026-05-20 — RAG retrieval quality upgrade (reranker + Docling + hybrid)
Three pieces, each tackling a different weakness of vanilla dense RAG:
- Reranker —
BAAI/bge-reranker-v2-m3deployed via TEI on the L40A GPU (http://…:8081/rerank). The pipeline now oversamples 30 candidates from Qdrant, then a cross-encoder re-scores them jointly with the query and returns the best 5. Measurable +15–25% recall@5 on standard benchmarks; tested on internal QA pairs with similar uplift. - Docling — IBM Docling Serve container deployed on the BigGuy GPU host (
http://…:8090/v1/convert/file). Replacespdftotextas the primary PDF parser. Preserves tables (structured markdown), reading order on multi-column layouts, figure-caption pairing, and falls back to OCR on scanned pages.pdftotextretained as automatic fallback if Docling is unreachable. - Hybrid search — Qdrant collections recreated with both dense (BGE-M3) and sparse (BM25) vectors. Query path uses Reciprocal Rank Fusion (RRF) natively in Qdrant 1.18+. Captures exact-term matches (codes, names, acronyms) that dense embeddings tend to bury.
Architecture diagram and tunable knobs documented in Concepts → RAG.
Cleanup: all pre-upgrade KBs were wiped (Qdrant collections + Postgres rows + filesystem) since the old schema is not hybrid-compatible.
2026-05-19 — RAG goes live + fleet expansion + docs rebuild
RAG (Retrieval-Augmented Generation)
- End-to-end RAG service launched in dashboard and API.
- Vector store: Qdrant on Blackwell host, persistent on dedicated NVMe.
- Embeddings: BGE-M3 (multilingual, 1024-dim) served on dedicated L40S GPU.
- LLM: Apertus 70B by default; any catalog model selectable per request.
- UI:
/dashboard/ragfor non-developers — drag-and-drop PDF/DOCX/MD/TXT, ask questions, cited answers. - API:
POST /api/v1/rag/kb,POST /kb/{slug}/docs,POST /kb/{slug}/chat. - Supported formats: PDF (via
pdftotext), DOCX, Markdown, plain text. - End-to-end latency: ~6 s for a 5-chunk retrieval + 100-token answer on Apertus.
Fleet
- L40A — second L40S online for embeddings (BGE-M3 hosting). 100 GbE.
- L40B — first Apertus serving node (vLLM, W4A16 quant). 145 tok/s @ conc=8.
- DGX Spark — Grace+Blackwell ARM, second Apertus serving node. 49 tok/s @ conc=8, ~3× lower TDP than L40S — best for
slowtier. - BigGuy — 4× RTX 6000 Pro Blackwell, downloading DeepSeek-V3-AWQ (327 GB).
- Mac mini × 2 — Ollama serving Qwen 2.5 1.5b + 7b. Used for
slowtier and title generation. - PTP 100 GbE link between two GPU hosts benched at 58 Gbps TCP, MTU 9000, 0.2 ms latency — ready for cross-host pipeline-parallel.
- MikroTik ROSE Data Server RDS2216 as gateway (ARM64 16 cores, 32 GB RAM, 2×100 GbE + 6×10 GbE).
- Juniper QFX5200-32C-32Q as core L2 switch (32×100 GbE + 32×40 GbE).
Backend infrastructure
backends+backend_modelstables added: pool-based fleet management with per-tier weights.BackendRouterservice routes requests by(tier, model)using weighted ranking + queue depth tiebreaker.VllmBackend+HttpBackend(Ollama): two client implementations, swappable per backend kind.InferenceHealthCheckcron polls every 30 s and updatesis_healthy,queue_depth,gpu_pressure.- Filament admin at
/admin/backendsfor live fleet management without redeploy.
Public site & wiki
- CDA-driven copy rewrites: hero, "Provalo, smetti di subire il CLOUD Act" section, "Due modi di lavorare", trust strip, removed several stale sections.
- Pricing UI: 4-card plan grid with toggle Monthly/Yearly, struck-through original on annual selection.
- Top-up selector: pill of 5 amounts (5/25/50/100/500 CHF), CHF → tokens estimate at slow tier, dynamic
Acquista tokenCTA. - This wiki: rebuilt from scratch — multi-page, markdown-backed, Cmd+K search, live data pages, copy buttons on code blocks, anchor headings, GH-style callouts.
Pricing model
- Removed deprecated
embeddingstier (only 4 tiers: slow/medium/fast/ludicrous). - Removed
surcharge_pctfromtier_settings(latent bug — was multiplying costs ×2; fixed and gone). - Split
Plan.price_chf_month+Plan.stripe_price_idintoProducttable;productsis now the single source of truth for prices. - Added
is_highlightedflag onPlan(was hardcoded toproin templates). - Added
is_visible_on_homeflag onModelCard(home shows curated set, /models shows full).
2026-05-17 — Production platform live
- siati.ai public site live at
https://siati.ai. - Dashboard at
https://my.siati.ai(registration, billing, playground, API keys). - Wiki at
https://wiki.siati.ai(single-page initially — now superseded). - Mobile API contract
/api/v1/*implemented (auth, /me/boot, chat sessions, IAP verify, billing v2). - Stripe integration (subscription checkout + topup), Apple/Google IAP verification endpoint.
- Cookie consent + Google Analytics gated by consent.
- Multi-domain Laravel app:
siati.ai,my.siati.ai,api.siati.ai,wiki.siati.ai,chat.siati.ai.
2026-05-15 — Backend rebuild started
- Started from scratch on Laravel 13 + Postgres 16 + Redis + Filament 4.
- Migrated user accounts + API keys from legacy backend (gigia_) preserving Argon2id hashes.
- Implemented
BearerApiKeymiddleware (HMAC-SHA256 hashing with salt). - JWT auth for mobile API (firebase/php-jwt, 30-day TTL).
Future entries will continue here. Anything material to the API or pricing is logged.