Concepts

RAG

Retrieval-Augmented Generation — chatting with your own documents, with citations.

Last updated: 2026-05-19

Retrieval-Augmented Generation (RAG)

A plain LLM only knows what was in its training data. RAG injects fresh, private knowledge at query time:

Your documents are split into chunks and embedded into vectors.
At question time, the user's query is embedded too.
The most similar chunks are fetched and pasted into the system prompt.
The LLM answers grounded on those chunks, citing them.

End result: the model knows your contracts, your wiki, your past tickets — without you having to fine-tune it.

What we provide

A two-stage hybrid retrieval pipeline built end-to-end on open components, all hosted in Switzerland:

Layer	Component	Role
Document parsing	Docling (IBM, MIT)	Layout-aware extraction: preserves tables as markdown, multi-column reading order, OCR for scanned pages. Falls back to `pdftotext` if Docling is down.
Chunking	Custom (PHP)	512-token chunks with 64-token overlap, respecting sentence boundaries.
Dense embeddings	BGE-M3 (BAAI, Apache 2.0)	Multilingual, 1024-dim, served on dedicated L40S GPU via TEI.
Sparse vectors	BM25 (in-house tokenizer)	IT/EN/DE/FR/ES stopword-aware, computed per chunk and per query. Captures exact-term matches (codes, names, acronyms).
Vector store	Qdrant 1.18	Persistent dense + sparse storage. Cosine for dense, IDF for sparse.
Hybrid query	Reciprocal Rank Fusion (RRF) in Qdrant	Combines dense + sparse rankings into one ordering.
Reranker	BGE-reranker-v2-m3 (BAAI, Apache 2.0)	Cross-encoder, scores `(query, chunk)` jointly. Reorders top-30 → top-K (+15-25% recall@5).
LLM	Apertus 70B by default, any model from the catalog	Generates the final answer grounded in the retrieved chunks.

The retrieval pipeline

Ingestion:                                 Query:
─────────                                  ──────
[PDF/DOCX/MD]                              [user question]
      ↓                                          ↓
 Docling parse                             BGE-M3 dense
      ↓                                    BM25 sparse
 Chunker (512/64)                                ↓
      ↓                                    Qdrant hybrid + RRF (top-30)
 BGE-M3 dense                                    ↓
 BM25 sparse                               BGE-reranker-v2-m3 (top-5)
      ↓                                          ↓
 Qdrant upsert                             Apertus 70B (or chosen model)
                                                 ↓
                                           Answer + cited sources

When RAG is the right tool

✅ Question-answering on a fixed corpus (contracts, manuals, policies, wikis). ✅ Internal search with semantic understanding (find what you mean, not just keywords). ✅ Compliance: the model can only quote what you gave it; less freedom to hallucinate.

❌ Real-time data feeds (use tool calling instead). ❌ Modifying the model's "personality" or output style (fine-tuning is the answer). ❌ Tiny corpus (< 10 docs) — the LLM context window probably fits everything already.

How to use it

Two options:

Through the dashboard (no code)

my.siati.ai/dashboard/rag: create a Knowledge Base, drag-and-drop PDFs/DOCX/MD/TXT, ask questions. The answer cites the source documents with anchors.

Through the API

See API: RAG for the four endpoints (create KB, upload doc, list docs, chat).

# Create a knowledge base
curl https://my.siati.ai/api/v1/rag/kb \
  -H "Authorization: Bearer $SIATI_JWT" \
  -H "Content-Type: application/json" \
  -d '{"name": "Contracts 2026"}'

# Upload a PDF (returns 202, indexes async)
curl https://my.siati.ai/api/v1/rag/kb/contracts-2026-abc123/docs \
  -H "Authorization: Bearer $SIATI_JWT" \
  -F file=@contract.pdf

# Chat
curl https://my.siati.ai/api/v1/rag/kb/contracts-2026-abc123/chat \
  -H "Authorization: Bearer $SIATI_JWT" \
  -H "Content-Type: application/json" \
  -d '{"question": "Termine di preavviso?", "model": "apertus-70b-instruct"}'

Defaults that work

We picked sensible defaults so RAG works out of the box:

Chunk size: 512 tokens with 64 overlap. Good for most prose.
Top-K: 5 chunks returned to the LLM, oversampled from 30 candidates before reranking.
Distance metric: cosine for dense, IDF for sparse, RRF for fusion.
System prompt template: model is instructed to only answer from context and to cite sources. No hallucinations from training data.
PDF max size: 50 MB. Indexing of a 500-page PDF: about 60–90 seconds end-to-end.

Why hybrid + reranker, not just dense

Pure dense retrieval (only BGE-M3) misses two common scenarios:

Exact-term queries: a user looks for "Articolo 1571 c.c." or "SKU 47B-A12". The dense embedding tends to dilute these into a general "legal article" or "product code" concept. BM25 sparse fixes this — exact tokens get high weight.
Semantic-to-lexical ambiguity: top-5 by cosine is sometimes "semantically close but wrong". A cross-encoder rereads each candidate jointly with the query and re-ranks more accurately. The reranker fixes this — measurable +15–25% recall@5 on standard RAG benchmarks.

We do both — at the cost of ~150 ms extra per query (rerank 30 chunks on L40S is fast) and a +30% storage overhead on the vector store (sparse vectors are tiny compared to dense).

Compared to OpenAI Assistants

	siati RAG	OpenAI Assistants
Data location	Switzerland	US (Azure)
Vector store	Qdrant (we run it)	OpenAI-managed
Embeddings	BGE-M3 open-weight	OpenAI proprietary
Document limit	Tied to your storage plan	20 files / Assistant
Audit trail	Full (we give it to you)	Limited
File deletion	Hard delete from our store	Soft delete
Cost model	Token-based + storage	Subscription + tokens

Retrieval-Augmented Generation (RAG)

What we provide#

The retrieval pipeline#

When RAG is the right tool#

How to use it#

Through the dashboard (no code)#

Through the API#

Defaults that work#

Why hybrid + reranker, not just dense#

Compared to OpenAI Assistants#

What we provide

The retrieval pipeline

When RAG is the right tool

How to use it

Through the dashboard (no code)

Through the API

Defaults that work

Why hybrid + reranker, not just dense

Compared to OpenAI Assistants