siati.ai docs

API reference

Rate limits

Per-tier RPM, response headers, retry strategy.

Last updated: 2026-05-19

Rate limits

Limits are per API key and per tier. Free tier is the most restrictive; the higher tiers carry the user's spend through.

Current limits (per minute, per API key)

Tier RPM
slow 60
medium 120
fast 240
ludicrous 1000

Live values: see Pricing.

How rate limits work

We use a sliding 1-minute window backed by Redis. Each request increments a counter; if it exceeds the tier's RPM, we return 429.

429 response

http
HTTP/1.1 429 Too Many Requests
Retry-After: 18
Content-Type: application/json

{
  "error": {
    "message": "rate limit exceeded (60/min)",
    "type": "rate_limit_exceeded"
  }
}

Retry-After is in seconds and is the safe time to wait before retrying. The window resets on the next minute boundary.

Recommended retry strategy

Exponential backoff with jitter, capped at 60s:

python
import time, random

def with_retry(fn, max_attempts=5):
    for attempt in range(max_attempts):
        try:
            return fn()
        except RateLimitError as e:
            wait = min(60, (2 ** attempt) + random.uniform(0, 0.5))
            if e.retry_after:
                wait = max(wait, e.retry_after)
            time.sleep(wait)
    raise RuntimeError("max retries exceeded")

The official OpenAI Python SDK has built-in retry on 429; it works as-is against siati.

Per-organisation limits

For enterprise customers we can set an aggregate org-wide RPM and split between keys. Talk to us if your scale needs that.

Concurrent requests

Independent of RPM. Each backend has a max_num_seqs (vLLM concurrency limit); the BackendRouter will pick a backend with spare slots. If all backends serving your model are saturated, a request queues (with priority based on tier).

The queue depth is exposed in Infrastructure (fleet) for transparency.