Cookbook

Tuning RAG quality

Knobs that move the needle on retrieval quality, in order of impact.

Last updated: 2026-05-20

Tuning RAG quality

We expose three handles per request and a few per-KB defaults. This is the playbook to use them.

Per-request knobs

POST /api/v1/rag/kb/{slug}/chat
{
  "question": "...",
  "model":    "apertus-70b-instruct",
  "tier":     "medium",
  "top_k":    5,
  "candidates": 30
}

Knob	Default	When to change
`top_k`	5	Increase to 10–15 for broad / exploratory questions where multiple sources matter. Keep at 3–5 for precise factual lookups.
`candidates`	30	Number of chunks pulled from Qdrant before reranking. Higher = better recall but slower. 50 for tough KBs, 20 for fast snappy KBs.
`model`	`apertus-70b-instruct`	Switch to `deepseek-r1-distill-70b` for reasoning-heavy tasks with visible chain-of-thought (legal interpretation, math, debug), or `mistral-large-2` for top general quality. Use `qwen2.5:7b-instruct-q4_K_M` for cheap drafts.
`tier`	`medium`	Bump to `fast` or `ludicrous` for user-facing latency-sensitive paths.

Per-KB defaults (set at creation)

Field	Default	Notes
`chunk_size_tokens`	512	Smaller (256) for fact-dense corpora (catalogues, datasheets). Larger (1024) for narrative prose.
`chunk_overlap_tokens`	64	~12% of chunk size is a good baseline. Higher overlap helps when answers span chunk boundaries.
`embedding_model`	`bge-m3`	Currently the only multilingual model we serve in this dim.

Anti-patterns

Reading the response

Every RAG chat response includes a sources array:

{
  "answer": "...",
  "sources": [
    {
      "score":             0.9824,      // reranker score 0..1 (sigmoid)
      "dense_score":       0.661,        // original RRF score from hybrid search
      "text":              "...",
      "document_filename": "contract.pdf",
      "chunk_idx":         12
    }
  ],
  "prompt_tokens":     1241,
  "completion_tokens": 178
}

score close to 1 = reranker is very confident the chunk answers the question. Below 0.3 = stretching.
dense_score is the pre-rerank fusion score; useful for debugging "why did this chunk make it through".
Big gap between dense_score and score = the reranker disagreed with the dense embedding's pick — usually a good thing.

When the answer is wrong

The pipeline can fail in three places:

Parsing: open the document chunks (admin → RAG → KB → docs) and search for the expected text. If not present, Docling missed it. Re-upload after manual extraction, or try a different format.
Retrieval: same chunk inspect — search for the answer's source manually. If present but never returned, increase candidates or top_k; consider rephrasing the query with keywords the document actually uses.
Generation: chunks are correct, model still answers wrong → switch to a stronger model (mistral-large-2, qwen2.5-72b-instruct, or deepseek-r1-distill-70b).

Hybrid in practice

The query "What's the cancellation fee in contract Acme?" benefits from:

Dense (BGE-M3) to find chunks about "cancellation", "termination", "rescission".
Sparse (BM25) to find chunks containing the exact literal "Acme" — even if it's not semantically close to the rest of the query.

RRF fusion combines both rankings without you having to do anything. If you're querying a corpus with lots of proper nouns / codes / dates, hybrid earns its keep. On pure prose corpora the impact is smaller.

Re-embedding a KB

If you change the chunker config or want to re-process all documents (e.g. after a Docling upgrade), drop and recreate the KB. We don't yet have a migration tool that keeps the same KB ID — coming Q2.

Tuning RAG quality

Per-request knobs#

Per-KB defaults (set at creation)#

Anti-patterns#

Reading the response#

When the answer is wrong#

Hybrid in practice#

Re-embedding a KB#

Per-request knobs

Per-KB defaults (set at creation)

Anti-patterns

Reading the response

When the answer is wrong

Hybrid in practice

Re-embedding a KB