API reference

Rate limits

Per-tier RPM, response headers, retry strategy.

Last updated: 2026-05-19

Rate limits

Limits are per API key and per tier. Free tier is the most restrictive; the higher tiers carry the user's spend through.

Current limits (per minute, per API key)

Tier	RPM
`slow`	60
`medium`	120
`fast`	240
`ludicrous`	1000

Live values: see Pricing.

How rate limits work

We use a sliding 1-minute window backed by Redis. Each request increments a counter; if it exceeds the tier's RPM, we return 429.

429 response

HTTP/1.1 429 Too Many Requests
Retry-After: 18
Content-Type: application/json

{
  "error": {
    "message": "rate limit exceeded (60/min)",
    "type": "rate_limit_exceeded"
  }
}

Retry-After is in seconds and is the safe time to wait before retrying. The window resets on the next minute boundary.

Recommended retry strategy

Exponential backoff with jitter, capped at 60s:

import time, random

def with_retry(fn, max_attempts=5):
    for attempt in range(max_attempts):
        try:
            return fn()
        except RateLimitError as e:
            wait = min(60, (2 ** attempt) + random.uniform(0, 0.5))
            if e.retry_after:
                wait = max(wait, e.retry_after)
            time.sleep(wait)
    raise RuntimeError("max retries exceeded")

The official OpenAI Python SDK has built-in retry on 429; it works as-is against siati.

Per-organisation limits

For enterprise customers we can set an aggregate org-wide RPM and split between keys. Talk to us if your scale needs that.

Concurrent requests

Independent of RPM. Each backend has a max_num_seqs (vLLM concurrency limit); the BackendRouter will pick a backend with spare slots. If all backends serving your model are saturated, a request queues (with priority based on tier).

The queue depth is exposed in Infrastructure (fleet) for transparency.

Rate limits

Current limits (per minute, per API key)#

How rate limits work#

429 response#

Recommended retry strategy#

Per-organisation limits#

Concurrent requests#

Current limits (per minute, per API key)

How rate limits work

429 response

Recommended retry strategy

Per-organisation limits

Concurrent requests