Cookbook — RAG over your PDFs#
End-to-end example: index a local corpus with bge-m3, chat over the content
with siati/llama-3.1-405b. No external vector DB needed for small corpora
(< 10K chunks).
Requirements#
pip install openai numpy
Full script (Python)#
"""
Minimal RAG: bge-m3 + Llama 405B, no vector DB.
For corpora < 10K chunks, in-memory cosine search beats any external DB
(faster, cheaper, less opinionated).
"""
import numpy as np
from openai import OpenAI
client = OpenAI(base_url="https://api.siati.ai/v1", api_key="siati_...")
# 1. Your corpus (replace with your own PDFs/DOCX/MD split into chunks)
corpus = [
"Swiss nFADP classifies health data as sensitive (art. 5).",
"The US CLOUD Act applies to all US providers, even with EU DCs.",
"FINMA Circ. 2018/3 governs IT outsourcing for Swiss banks.",
"Schrems II invalidated the EU-US Privacy Shield (CJEU C-311/18).",
# ... thousands of chunks
]
# 2. Index once (persist the result to disk/DB if needed)
print("Indexing corpus...")
vecs = np.array([
e.embedding
for e in client.embeddings.create(model="siati/bge-m3", input=corpus).data
])
print(f" → {len(corpus)} chunks, embeddings shape = {vecs.shape}")
def retrieve(question: str, k: int = 3) -> list[str]:
"""Top-k chunks most similar to the question (cosine similarity)."""
q = np.array(
client.embeddings.create(model="siati/bge-m3", input=[question])
.data[0].embedding
)
sims = vecs @ q / (np.linalg.norm(vecs, axis=1) * np.linalg.norm(q))
top_idx = sims.argsort()[-k:][::-1]
return [corpus[i] for i in top_idx]
def chat(question: str) -> str:
"""RAG: retrieve top-3 chunks, pass to 405B as context."""
chunks = retrieve(question, k=3)
context = "\n\n".join(f"- {c}" for c in chunks)
resp = client.chat.completions.create(
model="siati/llama-3.1-405b",
messages=[
{"role": "system", "content":
"Answer the question using ONLY the information in the context. "
"If the answer is not in the context, say so clearly."},
{"role": "user", "content":
f"Context:\n{context}\n\nQuestion: {question}"},
],
)
return resp.choices[0].message.content
# 3. Use it
print(chat("Can a Swiss bank use AWS Bedrock for client data?"))
Expected output#
Indexing corpus...
→ 4 chunks, embeddings shape = (4, 1024)
It is not advisable. AWS Bedrock is subject to the US CLOUD Act (even with
EU DCs), and FINMA Circ. 2018/3 on IT outsourcing requires controls that
US-jurisdiction services do not easily satisfy for regulated client data...
Why this works so well#
- No vector DB for small bases: 10K chunks × 1024 dim × 4 bytes = 40 MB in RAM. Cosine sim is a single numpy GEMM, sub-millisecond.
- bge-m3 multilingual: index text in IT/DE/FR/EN in the same vector space. A query in English matches German chunks if semantically close.
- Llama 405B as reasoner: with 3-5 relevant chunks in context, response quality is frontier-class.
- Zero data egress: chunks and queries never leave Switzerland.
When to move to a real vector DB#
- More than 100K chunks (in-memory becomes heavy)
- High concurrent requests (single Python thread isn't enough)
- Complex metadata filters (e.g. "search only in contracts signed in 2025")
Then consider Qdrant, Weaviate, Milvus — all self-hostable in Switzerland. Your AS stays local.
Useful variations#
- Hybrid search (BM25 + dense): add
rank_bm25for better recall on queries with rare terms (medical codes, IBANs, ATC codes, etc.) - Re-ranking: after top-20 retrieve, run a cross-encoder for top-5 (better quality, ~50ms extra)
- Hierarchical chunking: split by paragraph + index a summary of each section, navigate hierarchically for long documents
- Citation tracking: ask the model to cite chunk IDs used, make them clickable in the UI
LangChain integration#
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
emb = OpenAIEmbeddings(
base_url="https://api.siati.ai/v1",
api_key="siati_...",
model="siati/bge-m3",
)
llm = ChatOpenAI(
base_url="https://api.siati.ai/v1",
api_key="siati_...",
model="siati/llama-3.1-405b",
)
vectorstore = FAISS.from_texts(corpus, emb)
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
# ... use the retriever in a RetrievalQA chain