Skip to content

Cookbook — RAG over your PDFs#

End-to-end example: index a local corpus with bge-m3, chat over the content with siati/llama-3.1-405b. No external vector DB needed for small corpora (< 10K chunks).

Requirements#

pip install openai numpy

Full script (Python)#

"""
Minimal RAG: bge-m3 + Llama 405B, no vector DB.
For corpora < 10K chunks, in-memory cosine search beats any external DB
(faster, cheaper, less opinionated).
"""
import numpy as np
from openai import OpenAI

client = OpenAI(base_url="https://api.siati.ai/v1", api_key="siati_...")

# 1. Your corpus (replace with your own PDFs/DOCX/MD split into chunks)
corpus = [
    "Swiss nFADP classifies health data as sensitive (art. 5).",
    "The US CLOUD Act applies to all US providers, even with EU DCs.",
    "FINMA Circ. 2018/3 governs IT outsourcing for Swiss banks.",
    "Schrems II invalidated the EU-US Privacy Shield (CJEU C-311/18).",
    # ... thousands of chunks
]

# 2. Index once (persist the result to disk/DB if needed)
print("Indexing corpus...")
vecs = np.array([
    e.embedding
    for e in client.embeddings.create(model="siati/bge-m3", input=corpus).data
])
print(f"  → {len(corpus)} chunks, embeddings shape = {vecs.shape}")

def retrieve(question: str, k: int = 3) -> list[str]:
    """Top-k chunks most similar to the question (cosine similarity)."""
    q = np.array(
        client.embeddings.create(model="siati/bge-m3", input=[question])
        .data[0].embedding
    )
    sims = vecs @ q / (np.linalg.norm(vecs, axis=1) * np.linalg.norm(q))
    top_idx = sims.argsort()[-k:][::-1]
    return [corpus[i] for i in top_idx]

def chat(question: str) -> str:
    """RAG: retrieve top-3 chunks, pass to 405B as context."""
    chunks = retrieve(question, k=3)
    context = "\n\n".join(f"- {c}" for c in chunks)
    resp = client.chat.completions.create(
        model="siati/llama-3.1-405b",
        messages=[
            {"role": "system", "content":
             "Answer the question using ONLY the information in the context. "
             "If the answer is not in the context, say so clearly."},
            {"role": "user", "content":
             f"Context:\n{context}\n\nQuestion: {question}"},
        ],
    )
    return resp.choices[0].message.content

# 3. Use it
print(chat("Can a Swiss bank use AWS Bedrock for client data?"))

Expected output#

Indexing corpus...
  → 4 chunks, embeddings shape = (4, 1024)
It is not advisable. AWS Bedrock is subject to the US CLOUD Act (even with
EU DCs), and FINMA Circ. 2018/3 on IT outsourcing requires controls that
US-jurisdiction services do not easily satisfy for regulated client data...

Why this works so well#

  • No vector DB for small bases: 10K chunks × 1024 dim × 4 bytes = 40 MB in RAM. Cosine sim is a single numpy GEMM, sub-millisecond.
  • bge-m3 multilingual: index text in IT/DE/FR/EN in the same vector space. A query in English matches German chunks if semantically close.
  • Llama 405B as reasoner: with 3-5 relevant chunks in context, response quality is frontier-class.
  • Zero data egress: chunks and queries never leave Switzerland.

When to move to a real vector DB#

  • More than 100K chunks (in-memory becomes heavy)
  • High concurrent requests (single Python thread isn't enough)
  • Complex metadata filters (e.g. "search only in contracts signed in 2025")

Then consider Qdrant, Weaviate, Milvus — all self-hostable in Switzerland. Your AS stays local.

Useful variations#

  • Hybrid search (BM25 + dense): add rank_bm25 for better recall on queries with rare terms (medical codes, IBANs, ATC codes, etc.)
  • Re-ranking: after top-20 retrieve, run a cross-encoder for top-5 (better quality, ~50ms extra)
  • Hierarchical chunking: split by paragraph + index a summary of each section, navigate hierarchically for long documents
  • Citation tracking: ask the model to cite chunk IDs used, make them clickable in the UI

LangChain integration#

from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS

emb = OpenAIEmbeddings(
    base_url="https://api.siati.ai/v1",
    api_key="siati_...",
    model="siati/bge-m3",
)
llm = ChatOpenAI(
    base_url="https://api.siati.ai/v1",
    api_key="siati_...",
    model="siati/llama-3.1-405b",
)

vectorstore = FAISS.from_texts(corpus, emb)
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

# ... use the retriever in a RetrievalQA chain