Cookbook
Tuning RAG quality
Knobs that move the needle on retrieval quality, in order of impact.
Last updated: 2026-05-20
Tuning RAG quality
We expose three handles per request and a few per-KB defaults. This is the playbook to use them.
Per-request knobs
POST /api/v1/rag/kb/{slug}/chat
{
"question": "...",
"model": "apertus-70b-instruct",
"tier": "medium",
"top_k": 5,
"candidates": 30
}
| Knob | Default | When to change |
|---|---|---|
top_k |
5 | Increase to 10–15 for broad / exploratory questions where multiple sources matter. Keep at 3–5 for precise factual lookups. |
candidates |
30 | Number of chunks pulled from Qdrant before reranking. Higher = better recall but slower. 50 for tough KBs, 20 for fast snappy KBs. |
model |
apertus-70b-instruct |
Switch to deepseek-r1-distill-70b for reasoning-heavy tasks with visible chain-of-thought (legal interpretation, math, debug), or mistral-large-2 for top general quality. Use qwen2.5:7b-instruct-q4_K_M for cheap drafts. |
tier |
medium |
Bump to fast or ludicrous for user-facing latency-sensitive paths. |
Per-KB defaults (set at creation)
| Field | Default | Notes |
|---|---|---|
chunk_size_tokens |
512 | Smaller (256) for fact-dense corpora (catalogues, datasheets). Larger (1024) for narrative prose. |
chunk_overlap_tokens |
64 | ~12% of chunk size is a good baseline. Higher overlap helps when answers span chunk boundaries. |
embedding_model |
bge-m3 |
Currently the only multilingual model we serve in this dim. |
Anti-patterns
Reading the response
Every RAG chat response includes a sources array:
{
"answer": "...",
"sources": [
{
"score": 0.9824, // reranker score 0..1 (sigmoid)
"dense_score": 0.661, // original RRF score from hybrid search
"text": "...",
"document_filename": "contract.pdf",
"chunk_idx": 12
}
],
"prompt_tokens": 1241,
"completion_tokens": 178
}
scoreclose to 1 = reranker is very confident the chunk answers the question. Below 0.3 = stretching.dense_scoreis the pre-rerank fusion score; useful for debugging "why did this chunk make it through".- Big gap between
dense_scoreandscore= the reranker disagreed with the dense embedding's pick — usually a good thing.
When the answer is wrong
The pipeline can fail in three places:
- Parsing: open the document chunks (admin → RAG → KB → docs) and search for the expected text. If not present, Docling missed it. Re-upload after manual extraction, or try a different format.
- Retrieval: same chunk inspect — search for the answer's source manually. If present but never returned, increase
candidatesortop_k; consider rephrasing the query with keywords the document actually uses. - Generation: chunks are correct, model still answers wrong → switch to a stronger model (
mistral-large-2,qwen2.5-72b-instruct, ordeepseek-r1-distill-70b).
Hybrid in practice
The query "What's the cancellation fee in contract Acme?" benefits from:
- Dense (BGE-M3) to find chunks about "cancellation", "termination", "rescission".
- Sparse (BM25) to find chunks containing the exact literal "Acme" — even if it's not semantically close to the rest of the query.
RRF fusion combines both rankings without you having to do anything. If you're querying a corpus with lots of proper nouns / codes / dates, hybrid earns its keep. On pure prose corpora the impact is smaller.
Re-embedding a KB
If you change the chunker config or want to re-process all documents (e.g. after a Docling upgrade), drop and recreate the KB. We don't yet have a migration tool that keeps the same KB ID — coming Q2.