API reference

Embeddings & Rerank

POST /v1/embeddings (BGE-M3, 1024-dim) and POST /v1/rerank (BGE-reranker-v2-m3) — OpenAI-compatible, self-hosted in Switzerland.

Last updated: 2026-06-27

Embeddings

POST https://api.siati.ai/v1/embeddings

Convert text into 1024-dimensional vectors using BGE-M3 (multilingual). Use for semantic search, clustering, deduplication, or as the indexing side of a RAG pipeline.

Request

curl https://api.siati.ai/v1/embeddings \
  -H "Authorization: Bearer $SIATI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "bge-m3",
    "input": [
      "siati.ai è un provider svizzero di AI sovrana.",
      "Apertus è il modello LLM della Swiss AI Initiative."
    ]
  }'

Parameters

Param	Type	Required	Description
`input`	string \| array<string>	✓	A single string or an array of strings. See limits below.
`model`	string	–	`bge-m3` (default). It's the only embedding model.
`encoding_format`	string	–	`float` (default) or `base64` (float32, little-endian).
`dimensions`	int	–	Accepted for OpenAI compat but ignored — BGE-M3 is fixed at 1024.

Limits

Limit	Value	On exceed
Inputs per request (batch)	32	`400 invalid_request_error` — `too many inputs`
Tokens per single text	8192	text is truncated by the model
Vector dimension	1024 (fixed)	—
Rate limit	120 req/min per endpoint, HTTP 429 + your API-key tier limit	`429 rate_limit_exceeded` (with `Retry-After`)

For tiers and per-key limits see /api/rate-limits. Need a larger batch? Just send multiple requests — there's no daily cap on embeddings beyond your wallet balance.

Response

{
  "object": "list",
  "data": [
    { "object": "embedding", "index": 0, "embedding": [0.0123, -0.0456, "..."] },
    { "object": "embedding", "index": 1, "embedding": [0.0234, -0.0567, "..."] }
  ],
  "model": "bge-m3",
  "usage": { "prompt_tokens": 24, "total_tokens": 24 }
}

Each embedding is a 1024-float array, L2-normalised. Use cosine distance for similarity. usage tokens are estimated (the embedding backend doesn't return an exact count).

SDK example

from openai import OpenAI
import os, numpy as np

client = OpenAI(
    base_url="https://api.siati.ai/v1",
    api_key=os.environ["SIATI_API_KEY"],
)

texts = ["Apertus is a Swiss LLM.", "Llama is a US model.", "Pasta is Italian."]
embs = client.embeddings.create(model="bge-m3", input=texts).data
vectors = np.array([e.embedding for e in embs])

# Cosine similarity matrix (vectors are already L2-normalised)
sim = vectors @ vectors.T
print(sim)

Performance

Setup	Latency (1 text, ~100 tokens)	Throughput (batch 32)
BGE-M3, sovereign infra (CH)	~30 ms	~800 embeds/s

Embeddings are input-only billed — there's no "completion". Cost: see Pricing.

Tips

Cosine, not Euclidean: cosine distance is what BGE-M3 is trained for.
Chunk size matters: 256–512 tokens is the sweet spot for RAG. Longer chunks hurt retrieval precision.
Normalize before storing if you mix in vectors from other sources — BGE-M3 output is already L2-normalised.
Don't mix model spaces: never compare BGE-M3 vectors with OpenAI text-embedding-3 vectors.

Rerank

POST https://api.siati.ai/v1/rerank

Re-order a list of documents by relevance to a query using BGE-reranker-v2-m3, a cross-encoder. Far more precise than cosine similarity because it reads the query and each document together — at the cost of one forward pass per document.

The request/response shape follows the Cohere/Jina convention (there is no OpenAI standard for rerank).

Request

curl https://api.siati.ai/v1/rerank \
  -H "Authorization: Bearer $SIATI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "bge-reranker-v2-m3",
    "query": "Qual è la capitale della Svizzera?",
    "documents": [
      "Roma è la capitale italiana.",
      "Berna è la capitale della Confederazione Svizzera.",
      "Il Cervino è una montagna delle Alpi."
    ],
    "top_n": 2,
    "return_documents": true
  }'

Parameters

Param	Type	Required	Description
`query`	string	✓	The search query.
`documents`	array<string>	✓	Candidate documents to rank. Max 32 per request.
`top_n`	int	–	Return only the top N results. Default: all, sorted by score.
`return_documents`	bool	–	If `true`, echo each document's text in the result. Default `false`.
`model`	string	–	`bge-reranker-v2-m3` (default).

Response

{
  "object": "list",
  "model": "bge-reranker-v2-m3",
  "results": [
    { "index": 1, "relevance_score": 0.9988, "document": { "text": "Berna è la capitale della Confederazione Svizzera." } },
    { "index": 0, "relevance_score": 0.0020, "document": { "text": "Roma è la capitale italiana." } }
  ],
  "usage": { "total_tokens": 57 }
}

results are sorted by relevance_score descending (sigmoid-normalised 0..1). index refers to the position in your original documents array. document is present only when return_documents: true.

SDK example

The OpenAI SDK has no rerank method — call it directly:

import os, requests

r = requests.post(
    "https://api.siati.ai/v1/rerank",
    headers={"Authorization": f"Bearer {os.environ['SIATI_API_KEY']}"},
    json={
        "query": "best laptop for ML",
        "documents": [d["text"] for d in candidates],   # from your vector search
        "top_n": 5,
    },
).json()

top = [candidates[hit["index"]] for hit in r["results"]]

Limits

Same envelope as embeddings: max 32 documents per request, 8192 tokens per query+document pair, 120 req/min per endpoint. Each document is scored against the query (cross-encoder), so latency grows with the document count — keep candidate lists tight (≤ 32).

Embeddings

Request#

Parameters#

Limits#

Response#

SDK example#

Performance#

Tips#

Rerank

Request#

Parameters#

Response#

SDK example#

Limits#

Request

Parameters

Limits

Response

SDK example

Performance

Tips

Request

Parameters

Response

SDK example

Limits