siati.ai docs

Cookbook

Tuning RAG quality

Knobs that move the needle on retrieval quality, in order of impact.

Last updated: 2026-05-20

Tuning RAG quality

We expose three handles per request and a few per-KB defaults. This is the playbook to use them.

Per-request knobs

json
POST /api/v1/rag/kb/{slug}/chat
{
  "question": "...",
  "model":    "apertus-70b-instruct",
  "tier":     "medium",
  "top_k":    5,
  "candidates": 30
}
Knob Default When to change
top_k 5 Increase to 10–15 for broad / exploratory questions where multiple sources matter. Keep at 3–5 for precise factual lookups.
candidates 30 Number of chunks pulled from Qdrant before reranking. Higher = better recall but slower. 50 for tough KBs, 20 for fast snappy KBs.
model apertus-70b-instruct Switch to deepseek-r1-distill-70b for reasoning-heavy tasks with visible chain-of-thought (legal interpretation, math, debug), or mistral-large-2 for top general quality. Use qwen2.5:7b-instruct-q4_K_M for cheap drafts.
tier medium Bump to fast or ludicrous for user-facing latency-sensitive paths.

Per-KB defaults (set at creation)

Field Default Notes
chunk_size_tokens 512 Smaller (256) for fact-dense corpora (catalogues, datasheets). Larger (1024) for narrative prose.
chunk_overlap_tokens 64 ~12% of chunk size is a good baseline. Higher overlap helps when answers span chunk boundaries.
embedding_model bge-m3 Currently the only multilingual model we serve in this dim.

Anti-patterns

Reading the response

Every RAG chat response includes a sources array:

json
{
  "answer": "...",
  "sources": [
    {
      "score":             0.9824,      // reranker score 0..1 (sigmoid)
      "dense_score":       0.661,        // original RRF score from hybrid search
      "text":              "...",
      "document_filename": "contract.pdf",
      "chunk_idx":         12
    }
  ],
  "prompt_tokens":     1241,
  "completion_tokens": 178
}
  • score close to 1 = reranker is very confident the chunk answers the question. Below 0.3 = stretching.
  • dense_score is the pre-rerank fusion score; useful for debugging "why did this chunk make it through".
  • Big gap between dense_score and score = the reranker disagreed with the dense embedding's pick — usually a good thing.

When the answer is wrong

The pipeline can fail in three places:

  1. Parsing: open the document chunks (admin → RAG → KB → docs) and search for the expected text. If not present, Docling missed it. Re-upload after manual extraction, or try a different format.
  2. Retrieval: same chunk inspect — search for the answer's source manually. If present but never returned, increase candidates or top_k; consider rephrasing the query with keywords the document actually uses.
  3. Generation: chunks are correct, model still answers wrong → switch to a stronger model (mistral-large-2, qwen2.5-72b-instruct, or deepseek-r1-distill-70b).

Hybrid in practice

The query "What's the cancellation fee in contract Acme?" benefits from:

  • Dense (BGE-M3) to find chunks about "cancellation", "termination", "rescission".
  • Sparse (BM25) to find chunks containing the exact literal "Acme" — even if it's not semantically close to the rest of the query.

RRF fusion combines both rankings without you having to do anything. If you're querying a corpus with lots of proper nouns / codes / dates, hybrid earns its keep. On pure prose corpora the impact is smaller.

Re-embedding a KB

If you change the chunker config or want to re-process all documents (e.g. after a Docling upgrade), drop and recreate the KB. We don't yet have a migration tool that keeps the same KB ID — coming Q2.