Skip to content
All articles
ragembeddingschunking

How measuring my own wiki kept proving me wrong

Building a RAG over 250 pages of interview notes, and watching my assumptions die one measurement at a time.

It started with laziness

I wanted to automate my interview prep. The idea was to build a knowledge base from my notes and use it to simulate interviews, surfacing the gaps I needed to close rather than redoing the same manual review before every application.

So I built a wiki. The shape came from Karpathy's LLM Wiki gist. Rather than re-derive knowledge from raw sources on every query, you let the LLM incrementally build and maintain a persistent, cross-linked wiki. Raw sources stay immutable. The LLM owns the wiki layer. A schema file tells it how to behave.

I extended that in two directions. Every agent needed the same orientation, so the shared instructions went into a top-level AGENTS.md that any agent reads on arrival. And because one agent doing everything was a mess, I split the work into seven single-purpose agents, each writing only where its mandate allows. Of those seven, LIBRARIAN is the one that matters for this story. It restructures notes, merges stubs, expands thin sections, and maintains cross-links. Every other agent reads what it maintains.

The wiki is the single source of truth. Agents compound on each other through it, never through private state.

250 pages later, the tokens added up

At around 250 pages, "give the agent what it needs to know" had quietly become "give the agent a big slice of the wiki," and the token bill showed it. Time for retrieval: store the wiki as vectors, pull back only the few chunks each query needs.

So I built a RAG, and immediately hit the question that turned out to be the whole story: what granularity should the chunks be?

First wrong turn: thinking the files were already my chunks

My notes have headers and sections, so I figured I had semantic chunks for free. One file, one chunk. Done.

Then I measured. My files average eight sections each, so one vector per file smears eight topics together, and a query about one of them drags back the whole file. Splitting on headers fixed that. Section-level it is.

The same measurement caught a different mistake before I made it. I had planned to find my biggest file and buy an embedder whose context window could swallow it. My biggest files were log.md at 14,501 tokens, backlog.md at 11,417, and index.md at 7,976. None of them are knowledge. They are plumbing. I nearly sized my whole embedder choice around a file that should not be in the corpus at all.

ROOT: ~/wiki
included files: 251   excluded: 0
total tokens (content only): 310633

FILE-level:    n=251  max=14501  p95=2484  median=972  mean=1237
SECTION-level: n=1969  max=4685   p95=459   median=104  mean=157
  oversized  (> 512): 72 (3%)   -> split candidates
  undersized (< 64):  483 (24%) -> merge candidates
  sections/file: median=8 max=52

biggest sections:
  4685  wiki/backlog.md
  4479  wiki/backlog.md
  2933  wiki/decisions/ledger.md
  2146  wiki/debriefs/2026-05-27-[company]-[person].md
  1902  wiki/index.md

Lesson: scope the corpus before you measure it, not after.

The corpus, not the chunk size

So I dropped the plumbing and measured content only. The corpus itself drives every later decision.

Even without plumbing, my biggest sections were not the material I cared about. They were interview transcripts and debriefs. My wiki keeps very different kinds of writing side by side: interview prep, debriefs, AWS notes. All useful to me, all shaped differently, and one chunking rule across the lot is a compromise everywhere.

The tail surprised me too. I had braced for chunks too big for the embedder. The real problem was the reverse. 27% of my sections are under 64 tokens, too thin to retrieve well on their own. (The 64/512 boundaries are rules of thumb for the class of embedders I'm evaluating, small models in the text-embedding-3-small range, where the sweet spot sits between roughly 50 and 512 tokens. I haven't locked a final model yet; the point of measuring first is to let the data inform that choice rather than the reverse.)

ROOT: ~/wiki
included files: 227   excluded: 24 (index.md x20, backlog.md, ledger.md,
                                     2026-05-27-[company]-[person].md, log.md)
total tokens (content only): 229689

FILE-level:    n=227  max=4647  p95=2028  median=889  mean=1011
SECTION-level: n=1686 max=1369  p95=417   median=96   mean=136
  oversized  (> 512): 48 (2%)   -> split candidates
  undersized (< 64):  458 (27%) -> merge candidates
  sections/file: median=7 max=15

biggest sections:
  1369  wiki/interviews/2026-05-26-[company].md
  1156  wiki/interviews/2026-05-26-[company].md
  1155  wiki/interviews/2026-05-28-dry-run.md
  1006  wiki/sources/[company]/motivation_raw.md
   988  wiki/interviews/2026-05-26-[company]-[role].md

What this taught me

Some questions you look up, some you measure. "Does this chunk fit the embedder?" is a lookup, the same answer for everyone. "What granularity, and which embedder, work on my notes?" depends entirely on my data. My first plan treated a measure-question as a lookup.

Scope beats mechanics. Twice, "what is my biggest chunk?" resolved to a file that should not be indexed at all. What goes in the corpus moved my results far more than how I sliced it, and that is a judgment about what the system is for, not a number.

The leverage is in the source. Retrieval chunks are derived from how I write notes, and LIBRARIAN edits that same structure directly. Get the notes right, small atomic sections with clean headers, and both jobs get easier at once. The clever chunker I thought I needed is mostly downstream of how I write a single note.

Scoping down to what I actually retrieve

Mid-puzzle, I found a well-kept external repo, artreimus/notes-aws-machine-learning, that already maps the full MLA-C01 syllabus and is actively maintained against current AWS docs. It has no licence, which rules out folding it into my wiki. That turned out to clarify things rather than block them.

Narrowing to gaps means most of what I might have indexed was never mine to index. I measured the concepts-and-entities slice with everything else excluded. 95 files in, 156 out.

ROOT: ~/wiki
included files: 95   excluded: 156
total tokens (content only): 92769

FILE-level:    n=95   max=3205  p95=2028  median=879  mean=976
SECTION-level: n=801  max=832   p95=298   median=91   mean=115
  oversized  (> 512): 7  (0%)  -> split candidates
  undersized (< 64):  190 (23%) -> merge candidates
  sections/file: median=8  max=15

biggest sections:
   832  wiki/concepts/AI Coding Agents in Enterprise.md
   631  wiki/concepts/Tokenization Strategies.md
   619  wiki/concepts/LLMOps.md
   593  wiki/concepts/Vector Databases and RAG Architecture.md
   590  wiki/concepts/Tokenization Strategies.md

The scoping paid off cleanly:

  1. Tokens fell from 229k to 93k, not from harder chunking but because 156 files never belonged here. More were excluded than included, most of them operational artefacts.
  2. The oversized problem nearly vanished: 48 sections down to 7, the largest a comfortable 832 tokens. The files driving my chunk-size anxiety were transcripts and application notes, not the concept entries I actually retrieve.
  3. The undersized tail held at 23%. That one doesn't bow to scoping. It lives in the concept notes, short stubs that need merging. Mechanical, bounded, and mine to do.

What's next

The merge step. 190 sections under 64 tokens need a look. Some merge cleanly, some are stubs to expand, a few are fine as they are.

After that, the corpus is finally homogeneous. All 95 files are concept entries written to the same schema. The question is whether a clean single-domain index over these entries retrieves well on its own, without routing or clustering. My prediction: it will. If the next round of measurements proves me wrong, that will be interesting too.