siati.ai docs

Concepts

Tiers

How priority, throughput and price are tied together with one parameter.

Last updated: 2026-05-19

Tiers

A tier controls three things at once:

  1. Which hardware serves your request (BackendRouter weight)
  2. Where you sit in the queue (vLLM priority scheduler)
  3. How much you pay per token

There are four tiers. You choose either at API-key creation (default tier) or per-request via the X-Siati-Tier header.

The four tiers

Tier Typical use Hardware preference Queue priority
slow Background jobs, batch summaries, indexing Apple Silicon, ARM compute, idle GPU slots 1
medium Standard chat, dev, prototype L40S, GB10, mid-range 10
fast Production user-facing requests L40S, RTX 6000 Pro 100
ludicrous Real-time latency-sensitive, top quality RTX 6000 Pro Blackwell × 4, reserved slots 1000

Live prices: see Pricing.

How routing works

Every backend has a vector of weights per tier, set in the admin UI. For a request (tier, model), the router:

  1. Filters backends to those that (a) serve model, (b) are healthy, (c) have tier_weights[tier] > 0.
  2. Sorts by tier_weights[tier] DESC, then queue_depth ASC, then gpu_pressure ASC.
  3. Sends the request to the winner, with priority = priority_for(tier) in the vLLM body.

Example: a ludicrous request for apertus-70b-instruct lands on the 4× Blackwell node because its tier_weights[ludicrous] is the highest. A slow request for the same model goes to Spark (lower TDP, cheaper to operate, but slower per stream).

What you actually see

  • Lower tier → higher TTFT and lower tok/s/req when there's contention. Without contention, tiers feel similar.
  • Higher tier → guaranteed slot under load. Useful for user-facing apps where p95 matters.
  • Rate limits are per tier (see Pricing for req/min).

Setting the tier

Per API key (default)

Dashboard → API keys → edit → Default tier.

Per request (override)

bash
curl https://api.siati.ai/v1/chat/completions \
  -H "Authorization: Bearer $SIATI_API_KEY" \
  -H "X-Siati-Tier: ludicrous" \
  -H "Content-Type: application/json" \
  -d '{ "model": "apertus-70b-instruct", "messages": [...] }'

Why not just one tier?

Because our HW is heterogeneous on purpose. A slow request on a Mac mini costs us 30 W; a ludicrous request on a Blackwell node costs us 1.2 kW. Pricing reflects that. You shouldn't pay Blackwell rates for a background summarisation job.

Migrating from OpenAI

OpenAI has tiers too, but they're spend-based access levels, not request-level priority. The equivalence is approximate:

OpenAI Closest siati
Tier 1-2 (default) slow / medium
Tier 3-4 (paid more) fast
Tier 5 (enterprise) ludicrous
OpenAI "priority service" Built-in to ludicrous