Concepts
Tiers
How priority, throughput and price are tied together with one parameter.
Last updated: 2026-05-19
Tiers
A tier controls three things at once:
- Which hardware serves your request (BackendRouter weight)
- Where you sit in the queue (vLLM priority scheduler)
- How much you pay per token
There are four tiers. You choose either at API-key creation (default tier) or per-request via the X-Siati-Tier header.
The four tiers
| Tier | Typical use | Hardware preference | Queue priority |
|---|---|---|---|
slow |
Background jobs, batch summaries, indexing | Apple Silicon, ARM compute, idle GPU slots | 1 |
medium |
Standard chat, dev, prototype | L40S, GB10, mid-range | 10 |
fast |
Production user-facing requests | L40S, RTX 6000 Pro | 100 |
ludicrous |
Real-time latency-sensitive, top quality | RTX 6000 Pro Blackwell × 4, reserved slots | 1000 |
Live prices: see Pricing.
How routing works
Every backend has a vector of weights per tier, set in the admin UI. For a request (tier, model), the router:
- Filters backends to those that (a) serve
model, (b) are healthy, (c) havetier_weights[tier] > 0. - Sorts by
tier_weights[tier] DESC, thenqueue_depth ASC, thengpu_pressure ASC. - Sends the request to the winner, with
priority = priority_for(tier)in the vLLM body.
Example: a ludicrous request for apertus-70b-instruct lands on the 4× Blackwell node because its tier_weights[ludicrous] is the highest. A slow request for the same model goes to Spark (lower TDP, cheaper to operate, but slower per stream).
What you actually see
- Lower tier → higher TTFT and lower tok/s/req when there's contention. Without contention, tiers feel similar.
- Higher tier → guaranteed slot under load. Useful for user-facing apps where p95 matters.
- Rate limits are per tier (see Pricing for
req/min).
Setting the tier
Per API key (default)
Dashboard → API keys → edit → Default tier.
Per request (override)
curl https://api.siati.ai/v1/chat/completions \
-H "Authorization: Bearer $SIATI_API_KEY" \
-H "X-Siati-Tier: ludicrous" \
-H "Content-Type: application/json" \
-d '{ "model": "apertus-70b-instruct", "messages": [...] }'
Why not just one tier?
Because our HW is heterogeneous on purpose. A slow request on a Mac mini costs us 30 W; a ludicrous request on a Blackwell node costs us 1.2 kW. Pricing reflects that. You shouldn't pay Blackwell rates for a background summarisation job.
Migrating from OpenAI
OpenAI has tiers too, but they're spend-based access levels, not request-level priority. The equivalence is approximate:
| OpenAI | Closest siati |
|---|---|
Tier 1-2 (default) |
slow / medium |
Tier 3-4 (paid more) |
fast |
Tier 5 (enterprise) |
ludicrous |
| OpenAI "priority service" | Built-in to ludicrous |