siati.ai docs

Concepts

RAG

Retrieval-Augmented Generation — chatting with your own documents, with citations.

Last updated: 2026-05-19

Retrieval-Augmented Generation (RAG)

A plain LLM only knows what was in its training data. RAG injects fresh, private knowledge at query time:

  1. Your documents are split into chunks and embedded into vectors.
  2. At question time, the user's query is embedded too.
  3. The most similar chunks are fetched and pasted into the system prompt.
  4. The LLM answers grounded on those chunks, citing them.

End result: the model knows your contracts, your wiki, your past tickets — without you having to fine-tune it.

What we provide

A two-stage hybrid retrieval pipeline built end-to-end on open components, all hosted in Switzerland:

Layer Component Role
Document parsing Docling (IBM, MIT) Layout-aware extraction: preserves tables as markdown, multi-column reading order, OCR for scanned pages. Falls back to pdftotext if Docling is down.
Chunking Custom (PHP) 512-token chunks with 64-token overlap, respecting sentence boundaries.
Dense embeddings BGE-M3 (BAAI, Apache 2.0) Multilingual, 1024-dim, served on dedicated L40S GPU via TEI.
Sparse vectors BM25 (in-house tokenizer) IT/EN/DE/FR/ES stopword-aware, computed per chunk and per query. Captures exact-term matches (codes, names, acronyms).
Vector store Qdrant 1.18 Persistent dense + sparse storage. Cosine for dense, IDF for sparse.
Hybrid query Reciprocal Rank Fusion (RRF) in Qdrant Combines dense + sparse rankings into one ordering.
Reranker BGE-reranker-v2-m3 (BAAI, Apache 2.0) Cross-encoder, scores (query, chunk) jointly. Reorders top-30 → top-K (+15-25% recall@5).
LLM Apertus 70B by default, any model from the catalog Generates the final answer grounded in the retrieved chunks.

The retrieval pipeline

text
Ingestion:                                 Query:
─────────                                  ──────
[PDF/DOCX/MD]                              [user question]
      ↓                                          ↓
 Docling parse                             BGE-M3 dense
      ↓                                    BM25 sparse
 Chunker (512/64)                                ↓
      ↓                                    Qdrant hybrid + RRF (top-30)
 BGE-M3 dense                                    ↓
 BM25 sparse                               BGE-reranker-v2-m3 (top-5)
      ↓                                          ↓
 Qdrant upsert                             Apertus 70B (or chosen model)
                                                 ↓
                                           Answer + cited sources

When RAG is the right tool

✅ Question-answering on a fixed corpus (contracts, manuals, policies, wikis). ✅ Internal search with semantic understanding (find what you mean, not just keywords). ✅ Compliance: the model can only quote what you gave it; less freedom to hallucinate.

❌ Real-time data feeds (use tool calling instead). ❌ Modifying the model's "personality" or output style (fine-tuning is the answer). ❌ Tiny corpus (< 10 docs) — the LLM context window probably fits everything already.

How to use it

Two options:

Through the dashboard (no code)

my.siati.ai/dashboard/rag: create a Knowledge Base, drag-and-drop PDFs/DOCX/MD/TXT, ask questions. The answer cites the source documents with anchors.

Through the API

See API: RAG for the four endpoints (create KB, upload doc, list docs, chat).

bash
# Create a knowledge base
curl https://my.siati.ai/api/v1/rag/kb \
  -H "Authorization: Bearer $SIATI_JWT" \
  -H "Content-Type: application/json" \
  -d '{"name": "Contracts 2026"}'

# Upload a PDF (returns 202, indexes async)
curl https://my.siati.ai/api/v1/rag/kb/contracts-2026-abc123/docs \
  -H "Authorization: Bearer $SIATI_JWT" \
  -F file=@contract.pdf

# Chat
curl https://my.siati.ai/api/v1/rag/kb/contracts-2026-abc123/chat \
  -H "Authorization: Bearer $SIATI_JWT" \
  -H "Content-Type: application/json" \
  -d '{"question": "Termine di preavviso?", "model": "apertus-70b-instruct"}'

Defaults that work

We picked sensible defaults so RAG works out of the box:

  • Chunk size: 512 tokens with 64 overlap. Good for most prose.
  • Top-K: 5 chunks returned to the LLM, oversampled from 30 candidates before reranking.
  • Distance metric: cosine for dense, IDF for sparse, RRF for fusion.
  • System prompt template: model is instructed to only answer from context and to cite sources. No hallucinations from training data.
  • PDF max size: 50 MB. Indexing of a 500-page PDF: about 60–90 seconds end-to-end.

Why hybrid + reranker, not just dense

Pure dense retrieval (only BGE-M3) misses two common scenarios:

  1. Exact-term queries: a user looks for "Articolo 1571 c.c." or "SKU 47B-A12". The dense embedding tends to dilute these into a general "legal article" or "product code" concept. BM25 sparse fixes this — exact tokens get high weight.
  2. Semantic-to-lexical ambiguity: top-5 by cosine is sometimes "semantically close but wrong". A cross-encoder rereads each candidate jointly with the query and re-ranks more accurately. The reranker fixes this — measurable +15–25% recall@5 on standard RAG benchmarks.

We do both — at the cost of ~150 ms extra per query (rerank 30 chunks on L40S is fast) and a +30% storage overhead on the vector store (sparse vectors are tiny compared to dense).

Compared to OpenAI Assistants

siati RAG OpenAI Assistants
Data location Switzerland US (Azure)
Vector store Qdrant (we run it) OpenAI-managed
Embeddings BGE-M3 open-weight OpenAI proprietary
Document limit Tied to your storage plan 20 files / Assistant
Audit trail Full (we give it to you) Limited
File deletion Hard delete from our store Soft delete
Cost model Token-based + storage Subscription + tokens