Concepts
RAG
Retrieval-Augmented Generation — chatting with your own documents, with citations.
Last updated: 2026-05-19
Retrieval-Augmented Generation (RAG)
A plain LLM only knows what was in its training data. RAG injects fresh, private knowledge at query time:
- Your documents are split into chunks and embedded into vectors.
- At question time, the user's query is embedded too.
- The most similar chunks are fetched and pasted into the system prompt.
- The LLM answers grounded on those chunks, citing them.
End result: the model knows your contracts, your wiki, your past tickets — without you having to fine-tune it.
What we provide
A two-stage hybrid retrieval pipeline built end-to-end on open components, all hosted in Switzerland:
| Layer | Component | Role |
|---|---|---|
| Document parsing | Docling (IBM, MIT) | Layout-aware extraction: preserves tables as markdown, multi-column reading order, OCR for scanned pages. Falls back to pdftotext if Docling is down. |
| Chunking | Custom (PHP) | 512-token chunks with 64-token overlap, respecting sentence boundaries. |
| Dense embeddings | BGE-M3 (BAAI, Apache 2.0) | Multilingual, 1024-dim, served on dedicated L40S GPU via TEI. |
| Sparse vectors | BM25 (in-house tokenizer) | IT/EN/DE/FR/ES stopword-aware, computed per chunk and per query. Captures exact-term matches (codes, names, acronyms). |
| Vector store | Qdrant 1.18 | Persistent dense + sparse storage. Cosine for dense, IDF for sparse. |
| Hybrid query | Reciprocal Rank Fusion (RRF) in Qdrant | Combines dense + sparse rankings into one ordering. |
| Reranker | BGE-reranker-v2-m3 (BAAI, Apache 2.0) | Cross-encoder, scores (query, chunk) jointly. Reorders top-30 → top-K (+15-25% recall@5). |
| LLM | Apertus 70B by default, any model from the catalog | Generates the final answer grounded in the retrieved chunks. |
The retrieval pipeline
Ingestion: Query:
───────── ──────
[PDF/DOCX/MD] [user question]
↓ ↓
Docling parse BGE-M3 dense
↓ BM25 sparse
Chunker (512/64) ↓
↓ Qdrant hybrid + RRF (top-30)
BGE-M3 dense ↓
BM25 sparse BGE-reranker-v2-m3 (top-5)
↓ ↓
Qdrant upsert Apertus 70B (or chosen model)
↓
Answer + cited sources
When RAG is the right tool
✅ Question-answering on a fixed corpus (contracts, manuals, policies, wikis). ✅ Internal search with semantic understanding (find what you mean, not just keywords). ✅ Compliance: the model can only quote what you gave it; less freedom to hallucinate.
❌ Real-time data feeds (use tool calling instead). ❌ Modifying the model's "personality" or output style (fine-tuning is the answer). ❌ Tiny corpus (< 10 docs) — the LLM context window probably fits everything already.
How to use it
Two options:
Through the dashboard (no code)
my.siati.ai/dashboard/rag: create a Knowledge Base, drag-and-drop PDFs/DOCX/MD/TXT, ask questions. The answer cites the source documents with anchors.
Through the API
See API: RAG for the four endpoints (create KB, upload doc, list docs, chat).
# Create a knowledge base
curl https://my.siati.ai/api/v1/rag/kb \
-H "Authorization: Bearer $SIATI_JWT" \
-H "Content-Type: application/json" \
-d '{"name": "Contracts 2026"}'
# Upload a PDF (returns 202, indexes async)
curl https://my.siati.ai/api/v1/rag/kb/contracts-2026-abc123/docs \
-H "Authorization: Bearer $SIATI_JWT" \
-F file=@contract.pdf
# Chat
curl https://my.siati.ai/api/v1/rag/kb/contracts-2026-abc123/chat \
-H "Authorization: Bearer $SIATI_JWT" \
-H "Content-Type: application/json" \
-d '{"question": "Termine di preavviso?", "model": "apertus-70b-instruct"}'
Defaults that work
We picked sensible defaults so RAG works out of the box:
- Chunk size: 512 tokens with 64 overlap. Good for most prose.
- Top-K: 5 chunks returned to the LLM, oversampled from 30 candidates before reranking.
- Distance metric: cosine for dense, IDF for sparse, RRF for fusion.
- System prompt template: model is instructed to only answer from context and to cite sources. No hallucinations from training data.
- PDF max size: 50 MB. Indexing of a 500-page PDF: about 60–90 seconds end-to-end.
Why hybrid + reranker, not just dense
Pure dense retrieval (only BGE-M3) misses two common scenarios:
- Exact-term queries: a user looks for "Articolo 1571 c.c." or "SKU 47B-A12". The dense embedding tends to dilute these into a general "legal article" or "product code" concept. BM25 sparse fixes this — exact tokens get high weight.
- Semantic-to-lexical ambiguity: top-5 by cosine is sometimes "semantically close but wrong". A cross-encoder rereads each candidate jointly with the query and re-ranks more accurately. The reranker fixes this — measurable +15–25% recall@5 on standard RAG benchmarks.
We do both — at the cost of ~150 ms extra per query (rerank 30 chunks on L40S is fast) and a +30% storage overhead on the vector store (sparse vectors are tiny compared to dense).
Compared to OpenAI Assistants
| siati RAG | OpenAI Assistants | |
|---|---|---|
| Data location | Switzerland | US (Azure) |
| Vector store | Qdrant (we run it) | OpenAI-managed |
| Embeddings | BGE-M3 open-weight | OpenAI proprietary |
| Document limit | Tied to your storage plan | 20 files / Assistant |
| Audit trail | Full (we give it to you) | Limited |
| File deletion | Hard delete from our store | Soft delete |
| Cost model | Token-based + storage | Subscription + tokens |