How measuring my own wiki kept proving me wrong
Building a RAG over 250 pages of interview notes, and watching my assumptions die one measurement at a time.
Contents
It started with laziness
I wanted to automate my interview prep. The idea was to build a knowledge base from my notes and use it to simulate interviews, surfacing the gaps I needed to close rather than redoing the same manual review before every application.
So I built a wiki. The shape came from Karpathy's LLM Wiki gist. Rather than re-derive knowledge from raw sources on every query, you let the LLM incrementally build and maintain a persistent, cross-linked wiki. Raw sources stay immutable. The LLM owns the wiki layer. A schema file tells it how to behave.
I extended that in two directions. Every agent needed the same orientation, so the shared instructions went into a top-level AGENTS.md that any agent reads on arrival. And because one agent doing everything was a mess, I split the work into seven single-purpose agents, each writing only where its mandate allows. Of those seven, LIBRARIAN is the one that matters for this story. It restructures notes, merges stubs, expands thin sections, and maintains cross-links. Every other agent reads what it maintains.
The wiki is the single source of truth. Agents compound on each other through it, never through private state.
250 pages later, the tokens added up
At around 250 pages, "give the agent what it needs to know" had quietly become "give the agent a big slice of the wiki," and the token bill showed it. Time for retrieval: store the wiki as vectors, pull back only the few chunks each query needs.
So I built a RAG, and immediately hit the question that turned out to be the whole story: what granularity should the chunks be?
First wrong turn: thinking the files were already my chunks
My notes have headers and sections, so I figured I had semantic chunks for free. One file, one chunk. Done.
Then I measured. My files average eight sections each, so one vector per file smears eight topics together, and a query about one of them drags back the whole file. Splitting on headers fixed that. Section-level it is.
The same measurement caught a different mistake before I made it. I had planned to find my biggest file and buy an embedder whose context window could swallow it. My biggest files were log.md at 14,501 tokens, backlog.md at 11,417, and index.md at 7,976. None of them are knowledge. They are plumbing. I nearly sized my whole embedder choice around a file that should not be in the corpus at all.
ROOT: ~/wiki
included files: 251 excluded: 0
total tokens (content only): 310633
FILE-level: n=251 max=14501 p95=2484 median=972 mean=1237
SECTION-level: n=1969 max=4685 p95=459 median=104 mean=157
oversized (> 512): 72 (3%) -> split candidates
undersized (< 64): 483 (24%) -> merge candidates
sections/file: median=8 max=52
biggest sections:
4685 wiki/backlog.md
4479 wiki/backlog.md
2933 wiki/decisions/ledger.md
2146 wiki/debriefs/2026-05-27-[company]-[person].md
1902 wiki/index.md
Lesson: scope the corpus before you measure it, not after.
The corpus, not the chunk size
So I dropped the plumbing and measured content only. The corpus itself drives every later decision.
Even without plumbing, my biggest sections were not the material I cared about. They were interview transcripts and debriefs. My wiki keeps very different kinds of writing side by side: interview prep, debriefs, AWS notes. All useful to me, all shaped differently, and one chunking rule across the lot is a compromise everywhere.
The tail surprised me too. I had braced for chunks too big for the embedder. The real problem was the reverse. 27% of my sections are under 64 tokens, too thin to retrieve well on their own. (The 64/512 boundaries are rules of thumb for the class of embedders I'm evaluating, small models in the text-embedding-3-small range, where the sweet spot sits between roughly 50 and 512 tokens. I haven't locked a final model yet; the point of measuring first is to let the data inform that choice rather than the reverse.)
ROOT: ~/wiki
included files: 227 excluded: 24 (index.md x20, backlog.md, ledger.md,
2026-05-27-[company]-[person].md, log.md)
total tokens (content only): 229689
FILE-level: n=227 max=4647 p95=2028 median=889 mean=1011
SECTION-level: n=1686 max=1369 p95=417 median=96 mean=136
oversized (> 512): 48 (2%) -> split candidates
undersized (< 64): 458 (27%) -> merge candidates
sections/file: median=7 max=15
biggest sections:
1369 wiki/interviews/2026-05-26-[company].md
1156 wiki/interviews/2026-05-26-[company].md
1155 wiki/interviews/2026-05-28-dry-run.md
1006 wiki/sources/[company]/motivation_raw.md
988 wiki/interviews/2026-05-26-[company]-[role].md
What this taught me
Some questions you look up, some you measure. "Does this chunk fit the embedder?" is a lookup, the same answer for everyone. "What granularity, and which embedder, work on my notes?" depends entirely on my data. My first plan treated a measure-question as a lookup.
Scope beats mechanics. Twice, "what is my biggest chunk?" resolved to a file that should not be indexed at all. What goes in the corpus moved my results far more than how I sliced it, and that is a judgment about what the system is for, not a number.
The leverage is in the source. Retrieval chunks are derived from how I write notes, and LIBRARIAN edits that same structure directly. Get the notes right, small atomic sections with clean headers, and both jobs get easier at once. The clever chunker I thought I needed is mostly downstream of how I write a single note.
Scoping down to what I actually retrieve
Mid-puzzle, I found a well-kept external repo, artreimus/notes-aws-machine-learning, that already maps the full MLA-C01 syllabus and is actively maintained against current AWS docs. It has no licence, which rules out folding it into my wiki. That turned out to clarify things rather than block them.
Narrowing to gaps means most of what I might have indexed was never mine to index. I measured the concepts-and-entities slice with everything else excluded. 95 files in, 156 out.
ROOT: ~/wiki
included files: 95 excluded: 156
total tokens (content only): 92769
FILE-level: n=95 max=3205 p95=2028 median=879 mean=976
SECTION-level: n=801 max=832 p95=298 median=91 mean=115
oversized (> 512): 7 (0%) -> split candidates
undersized (< 64): 190 (23%) -> merge candidates
sections/file: median=8 max=15
biggest sections:
832 wiki/concepts/AI Coding Agents in Enterprise.md
631 wiki/concepts/Tokenization Strategies.md
619 wiki/concepts/LLMOps.md
593 wiki/concepts/Vector Databases and RAG Architecture.md
590 wiki/concepts/Tokenization Strategies.md
The scoping paid off cleanly:
- Tokens fell from 229k to 93k, not from harder chunking but because 156 files never belonged here. More were excluded than included, most of them operational artefacts.
- The oversized problem nearly vanished: 48 sections down to 7, the largest a comfortable 832 tokens. The files driving my chunk-size anxiety were transcripts and application notes, not the concept entries I actually retrieve.
- The undersized tail held at 23%. That one doesn't bow to scoping. It lives in the concept notes, short stubs that need merging. Mechanical, bounded, and mine to do.
What's next
The merge step. 190 sections under 64 tokens need a look. Some merge cleanly, some are stubs to expand, a few are fine as they are.
After that, the corpus is finally homogeneous. All 95 files are concept entries written to the same schema. The question is whether a clean single-domain index over these entries retrieves well on its own, without routing or clustering. My prediction: it will. If the next round of measurements proves me wrong, that will be interesting too.