Concepts

Tiers

How priority, throughput and price are tied together with one parameter.

Last updated: 2026-05-19

Tiers

A tier controls three things at once:

Which hardware serves your request (BackendRouter weight)
Where you sit in the queue (vLLM priority scheduler)
How much you pay per token

There are four tiers. You choose either at API-key creation (default tier) or per-request via the X-Siati-Tier header.

The four tiers

Tier	Typical use	Hardware preference	Queue priority
`slow`	Background jobs, batch summaries, indexing	Apple Silicon, ARM compute, idle GPU slots	1
`medium`	Standard chat, dev, prototype	L40S, GB10, mid-range	10
`fast`	Production user-facing requests	L40S, RTX 6000 Pro	100
`ludicrous`	Real-time latency-sensitive, top quality	RTX 6000 Pro Blackwell × 4, reserved slots	1000

Live prices: see Pricing.

How routing works

Every backend has a vector of weights per tier, set in the admin UI. For a request (tier, model), the router:

Filters backends to those that (a) serve model, (b) are healthy, (c) have tier_weights[tier] > 0.
Sorts by tier_weights[tier] DESC, then queue_depth ASC, then gpu_pressure ASC.
Sends the request to the winner, with priority = priority_for(tier) in the vLLM body.

Example: a ludicrous request for apertus-70b-instruct lands on the 4× Blackwell node because its tier_weights[ludicrous] is the highest. A slow request for the same model goes to Spark (lower TDP, cheaper to operate, but slower per stream).

What you actually see

Lower tier → higher TTFT and lower tok/s/req when there's contention. Without contention, tiers feel similar.
Higher tier → guaranteed slot under load. Useful for user-facing apps where p95 matters.
Rate limits are per tier (see Pricing for req/min).

Setting the tier

Per API key (default)

Dashboard → API keys → edit → Default tier.

Per request (override)

curl https://api.siati.ai/v1/chat/completions \
  -H "Authorization: Bearer $SIATI_API_KEY" \
  -H "X-Siati-Tier: ludicrous" \
  -H "Content-Type: application/json" \
  -d '{ "model": "apertus-70b-instruct", "messages": [...] }'

Why not just one tier?

Because our HW is heterogeneous on purpose. A slow request on a Mac mini costs us 30 W; a ludicrous request on a Blackwell node costs us 1.2 kW. Pricing reflects that. You shouldn't pay Blackwell rates for a background summarisation job.

Migrating from OpenAI

OpenAI has tiers too, but they're spend-based access levels, not request-level priority. The equivalence is approximate:

OpenAI	Closest siati
`Tier 1-2` (default)	`slow / medium`
`Tier 3-4` (paid more)	`fast`
`Tier 5` (enterprise)	`ludicrous`
OpenAI "priority service"	Built-in to `ludicrous`

Tiers

The four tiers#

How routing works#

What you actually see#

Setting the tier#

Per API key (default)#

Per request (override)#

Why not just one tier?#

Migrating from OpenAI#

The four tiers

How routing works

What you actually see

Setting the tier

Per API key (default)

Per request (override)

Why not just one tier?

Migrating from OpenAI