I Built a RAG for my family’s cat. Here’s what the real work looks like.

A real use case, a strict grounding constraint, and what building it actually taught me.

Our family cat Berlioz was diagnosed with Chronic Kidney Disease. The vet left without much dietary guidance. My mother started sending me photos of ingredient labels to validate — phosphorus levels, inorganic phosphates, urinary acidifiers. Every week.

I have a biochemistry background. I read the literature, compiled a knowledge base, wrote a document. That document created a new problem: now she had questions. Lots of them. Regularly.

I could have told her to ask ChatGPT. But a general-purpose LLM will confidently invent nutritional thresholds for a sick cat. A hallucinated phosphorus value of 1.3 g/Mcal instead of 1.0 isn’t “close enough” — it’s potentially harmful. That’s a hard no.

I knew RAG. I’d read the papers, followed the benchmarks, understood the pattern from my tech watch and studies. So I built Berlioz — a feline nutrition analyzer with a strictly grounded RAG chat, deployed on Vercel, zero infrastructure cost.

What I didn’t expect was how much building it for real — on a domain where mistakes have consequences — would force me to think differently about every decision in the pipeline.

The problem: why a general LLM fails here

Before architecture, the why matters.

The naive solution is to paste the question into ChatGPT with some context. The problem is structural: a generative model fills gaps in its knowledge with plausible extrapolations. For feline nephrology — a niche domain with precise thresholds derived from specific studies — plausible is dangerous.

The two alternatives I considered seriously:

Fine-tuning. Retrain the model on veterinary nutrition data so the knowledge becomes intrinsic to its weights. Sounds right. Doesn’t work. Fine-tuning teaches a model to sound like a feline nutritionist — it learns style, not facts. It will interpolate. If the training data says “phosphorus ≤ 1.2 g/Mcal”, the fine-tuned model might output 1.15 or 1.3 depending on context. It generates, it doesn’t cite. Add the cost of retraining every time new research comes out, and fine-tuning is a non-starter for a domain that evolves.

RAG. Keep the knowledge external. At query time, retrieve the relevant chunks and inject them into the prompt. The LLM reasons on the retrieved data — it doesn’t generate from weights alone. Update the knowledge base? Re-index. No retraining. No frozen knowledge.

The formula: RAG = LLM as reasoning engine + vector database as external memory.

The theory is clean. The practice has edges.

The architecture

Two phases. Different lifecycles.

Indexing — runs once offline, then again whenever the knowledge base changes:

data/knowledge_base.md
    │ semantic chunking by H2/H3 headings
    ▼
~45 chunks (~500 tokens each, sectionPath in metadata)
    │ embed via OpenAI text-embedding-3-small (1536 dim)
    │ SHA256 cache → skip unchanged chunks
    ▼
Upstash Vector DB (upsert in batches of 5)

Retrieval — runs on every user message:

user question
    │ embed via same model (LRU in-memory cache)
    ▼
Upstash query → top-6 chunks (cosine similarity)
    │ filter score ≥ 0.7
    ▼
<knowledge_base> block injected into system prompt
    │
Groq Kimi K2 (temp 0.15) → answer grounded strictly on retrieved context

The full stack: Vanilla JS frontend on Vercel, Node.js serverless functions for the API, Upstash Vector for the vector store, OpenAI for embeddings, Groq for inference.

One detail worth noting: two different Groq models serve different tasks. llama-3.3-70b-versatile handles structured JSON extraction from raw label text (/api/normalize) — a deterministic extraction task where output format matters more than reasoning. kimi-k2-instruct handles the RAG chat (/api/chat) — selected after comparative testing for strict instruction-following in French. Both run at low temperature (0.1 and 0.15 respectively) for near-deterministic outputs.

Lesson 1: Chunking is an architectural decision, not a preprocessing step

Every RAG tutorial I’d read treated chunking as a detail — “split your text every 512 tokens, add some overlap, you’re done.”

That works for generic corpora. It breaks on structured domain knowledge.

The knowledge base for Berlioz is a structured literature review: each ## section covers a nutrient category (proteins, phosphorus, lipids, carbohydrates, fibers, additives), each ### subsection covers a specific molecule or threshold with its mechanism, evidence level, and practical limits. The corpus covers ~15 studies (2006–2024) and produces ~45 indexed chunks. Cutting across those boundaries at fixed token counts would separate a phosphorus threshold from its evidence level, or a clinical recommendation from its exception.

The decision: semantic chunking by heading. Each ## or ### section becomes one chunk. The section hierarchy becomes a sectionPath stored in metadata.

function chunkMarkdown(content, source) {
  const lines = content.split('\n');
  const headingStack = [];
for (const line of lines) {
    const h2 = line.match(/^## (.+)$/);
    const h3 = line.match(/^### (.+)$/);
    if (h2 || h3) {
      // flush current chunk before starting a new one
      if (currentChunk.length > 0) {
        chunks.push({
          text: currentChunk.join('\n').trim(),
          sectionPath: buildSectionPath(headingStack)
        });
      }
      if (h2) { headingStack.length = 0; headingStack.push(h2[1]); }
      else    { if (headingStack.length > 1) headingStack.length = 1;
                headingStack.push(h3[1]); }
    }
  }
}

The sectionPath produces paths like "B. Minerals > 4. High total phosphorus (especially inorganic)". This ends up in the vector metadata and becomes the most useful debugging tool in the entire system — when a response looks wrong, you trace it directly to the chunk that fed it.

The chunking strategy is inseparable from the document structure and the domain. There is no universal right answer — and that’s the first thing you figure out when you stop following tutorials and start building something real.

Lesson 2: The similarity threshold is not a hyperparameter — it’s a policy decision

The theory says: retrieve the top-K chunks by cosine similarity. In practice, Upstash always returns exactly K results — even if all of them are semantically unrelated to the question.

You need a threshold. The question is what that threshold means for your use case.

const relevantChunks = results
  .filter(r => r.score >= 0.7)
  .map(r => ({
    sectionPath: r.metadata?.sectionPath,
    text: r.data,
    score: r.score
  }));
if (relevantChunks.length === 0) {
  return '';  // triggers "I don't have this information" response
}

Setting it to 0.7 wasn’t a random choice. Too low (0.4–0.5) and you inject chunks that are topically adjacent but not actually relevant — the LLM gets confused context and produces worse answers. Too high (0.85+) and legitimate questions fail to retrieve anything.

But here’s what the papers don’t say clearly: 0.7 is a policy, not a truth. It encodes a decision about what kind of errors you’re willing to accept. For Berlioz, a false negative (telling my mother “I don’t have this information” when I actually do) is far less harmful than a false positive (injecting irrelevant context that leads to a wrong nutritional verdict). So the threshold is deliberately conservative.

The threshold isn’t a technical parameter to optimize — it’s a risk tolerance decision specific to your domain. For a production system with volume, you’d calibrate it on a labeled query set. For a family side project, you calibrate it empirically and document the choice.

Lesson 3: The system prompt is the most critical safety layer — and the most underrated

Having the right chunks in context is necessary but not sufficient. Without strict constraints, an LLM will blend its internal knowledge with the retrieved context. For a general Q&A chatbot, that’s fine. For a medical domain chatbot, it’s the exact failure mode you built RAG to prevent.

Every tutorial I’d seen treated the system prompt as a formatting concern — “tell the model to be helpful and concise.” The Berlioz system prompt is a hard constraint:

⚠️ You do NOT have your own knowledge about feline nutrition.
⚠️ You MUST use ONLY the information provided in <knowledge_base>.
⚠️ If the information is NOT in <knowledge_base>, say:
   "I don't have this information in my knowledge base."
FORBIDDEN:
- Inventing nutritional values
- Inventing CKD/UTI thresholds
- Using "general knowledge"

Two things I learned building this:

First, a prohibition is stronger than a preference. “Prefer the knowledge base” creates probabilistic behavior. “Only use the knowledge base” creates a hard constraint. The difference is qualitative — one is a style instruction, the other is an architecture decision expressed in natural language.

Second, empty context must be handled explicitly. When retrieval returns nothing above the threshold, the <knowledge_base> block contains [NO DATA RETRIEVED] and the model is instructed to ask the user to rephrase. Without this, the model defaults to its internal knowledge and the entire RAG grounding breaks silently.

function buildSystemPrompt(context) {
  if (!context || context.trim().length === 0) {
    return `${SYSTEM_BASE}
<knowledge_base>
[NO DATA RETRIEVED]
Ask the user to rephrase or provide the exact product composition.
</knowledge_base>`;
  }
  return `${SYSTEM_BASE}
<knowledge_base>
${context}
</knowledge_base>
REMINDER: Use ONLY the information above. Do not invent anything.`;
}

The system prompt isn’t a formatting concern. It’s a contract between you and the model about what it’s allowed to do. Write it like one.

Lesson 4: Observability is not optional

A RAG system is a black box by default. A question goes in, an answer comes out. When the answer is wrong, you have no idea where the failure happened — was it retrieval? Was it a bad chunk? Was the model drifting outside the context?

The fix is cheap: log the sectionPath and score of every chunk that enters the context. That single addition transforms debugging from guesswork into an audit trail.

console.log('[RAG] ✅', relevantChunks.length, 'chunks retained');
relevantChunks.forEach(c => 
  console.log(`  [${c.score.toFixed(3)}] ${c.sectionPath}`)
);

When a response looks suspicious, you check the logs. Either the wrong chunks were retrieved (chunking or threshold problem), the right chunks were retrieved but the model ignored them (system prompt problem), or no chunks were retrieved (the question phrased outside the knowledge base vocabulary). Each failure mode has a different fix. Without logging, they’re indistinguishable. You can’t evaluate what you can’t see — start by making retrieval visible.

Lesson 5: The knowledge base quality is the ceiling

This is the one the papers actually get right, but it doesn’t sink in until you build something.

The RAG is only as good as what it retrieves. And what it retrieves is only as good as what was indexed. No amount of prompt engineering compensates for a knowledge base that’s incomplete, inconsistently structured, or poorly sourced.

For Berlioz, the knowledge base is a ~20KB structured literature review I wrote from scratch — 15+ studies, each section annotated with evidence levels (🟢 strong / 🟡 moderate / 🔴 weak). Writing the knowledge base took longer than building the entire pipeline. That’s not a bug. That’s the correct allocation of effort. The knowledge base is the product. The RAG pipeline is the delivery mechanism.

What the system looks like today

The complete request flow:

User: "Is this phosphorus level ok for a CKD cat?"
  │
  ├─ api/chat.js receives POST /api/chat
  │    ├─ extract last user message
  │    ├─ retrieve(message, k=6)
  │    │    ├─ embedQuery()      → OpenAI API (or LRU cache hit)
  │    │    ├─ queryVector()     → Upstash REST
  │    │    ├─ filter score ≥ 0.7
  │    │    └─ buildContext()    → markdown ≤ 2000 tokens
  │    ├─ buildSystemPrompt(context)
  │    └─ POST Groq (kimi-k2, temp=0.15, max_tokens=1500)
  │
  └─ { text: "Based on the knowledge base, phosphorus above 1.0%MS..." }

Known limitations, documented honestly:

Chunking quality is a ceiling. Two related pieces of information in separate chunks, retrieval returns only one — the answer is incomplete. Semantic chunking by heading reduces this risk, it doesn’t eliminate it.
The threshold is empirical. A question phrased very differently from how the knowledge base is written may return “I don’t have this information” even when the answer exists.
The LLM can still drift. Strict instructions make drift rare, not impossible. On highly ambiguous questions, the model may occasionally slip toward internal knowledge.
Single-source knowledge base. Intentional for quality control. Limits coverage by design.

What I’d do differently

Add evaluation from day one. had no formal Ground Truth dataset. I knew the system worked because my mother stopped calling me about Berlioz’s meal — that’s not a metric. Tools like LangSmith let you measure groundedness, relevance and correctness via LLM-as-judge on a labeled dataset. I’d build that before shipping next time.

👉 Complete RAG evaluation tutorial by LangChain

Design the knowledge base for retrieval, not for reading. The KB was written to be read by a human expert. Some sections mix mechanisms, thresholds, and evidence levels in a single block — which makes for a good document but creates retrieval ambiguity. Next time I’d separate those concerns at the writing stage.

Treat the similarity threshold as a monitored parameter. Right now it’s a hardcoded constant. In production, you’d want to track the distribution of scores per query, detect drift when the KB changes, and alert when too many queries return zero chunks.

Conclusion

The RAG behind Berlioz isn’t sophisticated by research standards — no reranking, no HyDE, no multi-hop reasoning. It’s a well-scoped RAG: a stable structured knowledge base, a strict grounding constraint, and tool choices that match real deployment constraints.

The gap between knowing the RAG pattern and shipping a RAG that you trust in a domain where mistakes matter is exactly where the real engineering happens. It’s not in the vector math. It’s in the chunking strategy, the threshold policy, the system prompt contract, and the observability you build around it.

I’m Camille Lebrun, a Data Consultant. I write about SQL optimization, data modeling, and analytics engineering.

Found this useful? Clap on Medium to help other analytics engineers find it. What’s the most misleading aggregation you’ve seen shipped to a stakeholder? Drop it in the comments.