API reference
Rate limits
Per-tier RPM, response headers, retry strategy.
Last updated: 2026-05-19
Rate limits
Limits are per API key and per tier. Free tier is the most restrictive; the higher tiers carry the user's spend through.
Current limits (per minute, per API key)
| Tier | RPM |
|---|---|
slow |
60 |
medium |
120 |
fast |
240 |
ludicrous |
1000 |
Live values: see Pricing.
How rate limits work
We use a sliding 1-minute window backed by Redis. Each request increments a counter; if it exceeds the tier's RPM, we return 429.
429 response
HTTP/1.1 429 Too Many Requests
Retry-After: 18
Content-Type: application/json
{
"error": {
"message": "rate limit exceeded (60/min)",
"type": "rate_limit_exceeded"
}
}
Retry-After is in seconds and is the safe time to wait before retrying. The window resets on the next minute boundary.
Recommended retry strategy
Exponential backoff with jitter, capped at 60s:
import time, random
def with_retry(fn, max_attempts=5):
for attempt in range(max_attempts):
try:
return fn()
except RateLimitError as e:
wait = min(60, (2 ** attempt) + random.uniform(0, 0.5))
if e.retry_after:
wait = max(wait, e.retry_after)
time.sleep(wait)
raise RuntimeError("max retries exceeded")
The official OpenAI Python SDK has built-in retry on 429; it works as-is against siati.
Per-organisation limits
For enterprise customers we can set an aggregate org-wide RPM and split between keys. Talk to us if your scale needs that.
Concurrent requests
Independent of RPM. Each backend has a max_num_seqs (vLLM concurrency limit); the BackendRouter will pick a backend with spare slots. If all backends serving your model are saturated, a request queues (with priority based on tier).
The queue depth is exposed in Infrastructure (fleet) for transparency.