9 June 2026

ragembeddingschunking

Merging my wiki, then building the baseline

A 25% drop in section count, a measurement bug, a retrieval pipeline that didn't need the clever parts, and recall@5 = 1.000 on the first try.

This picks up where the previous post ended. I had 95 concept-and-entity files, 801 sections, and 190 of them under the 64-token floor. The plan was to merge the short ones.

The corpus moved while I was working

By the time I ran the merge, the wiki had grown. 95 files had become 109, and the section count had climbed from 801 to 954, with the undersized tail holding steady at 21%. All the numbers below are measured against that updated baseline.

FILE-level:    n=109  max=4195  p95=2464  median=966  mean=1093
SECTION-level: n=954  max=832  p95=328  median=98  mean=124
  oversized  (> 512): 12 (1%)  -> split these
  undersized (< 64): 208 (21%)  -> merge candidates
  sections/file: median=9  max=15

A bug in the measurement script

Before I could trust any of the counts, I needed to fix the script. The header regex was matching # anywhere in the file, including inside fenced code blocks. A bash comment like # Start server or a YAML snippet was being read as a heading and spun into a phantom section of five to eight tokens. Frontmatter had a related problem: the --- ... --- block had no # heading of its own, so it became a preamble chunk of 50 to 63 tokens, pure metadata that nothing would ever retrieve.

The original per-file loop:

file_tokens, sections, sections_per_file = [], [], []
for f in files:
    text = f.read_text(encoding="utf-8", errors="ignore")
    file_tokens.append(ntok(text))
    idxs = [m.start() for m in header_re.finditer(text)]
    bounds = ([0] if (not idxs or idxs[0] > 0) else []) + idxs + [len(text)]
    secs = [text[bounds[i]:bounds[i+1]] for i in range(len(bounds)-1)]
    secs = [s for s in secs if s.strip()]
    sections_per_file.append(len(secs))
    sections += [(ntok(s), f) for s in secs]

The fix adds two helper regexes and splits which string gets scanned from which gets sliced:

# Replace fenced code blocks with spaces of the same length so '#' inside
# bash/yaml/etc. snippets are not mistaken for headings, while character
# positions and token counts stay accurate.
fence_re = re.compile(r'```.*?```', re.S)
# Strip YAML frontmatter so it doesn't become a preamble chunk.
frontmatter_re = re.compile(r'\A---\n.*?\n---\n', re.S)

def _blank_code(text):
    return fence_re.sub(lambda m: ' ' * len(m.group(0)), text)

    raw = f.read_text(encoding="utf-8", errors="ignore")
    file_tokens.append(ntok(raw))
    text = frontmatter_re.sub('', raw)
    text_scan = _blank_code(text)   # code blocks blanked; positions preserved
    idxs = [m.start() for m in header_re.finditer(text_scan)]
    bounds = ([0] if (not idxs or idxs[0] > 0) else []) + idxs + [len(text)]
    secs = [text[bounds[i]:bounds[i+1]] for i in range(len(bounds)-1)]
    secs = [s for s in secs if s.strip()]
    sections_per_file.append(len(secs))
    sections += [(ntok(s), f, _heading(s)) for s in secs]

Three things happen here. Frontmatter is stripped from text before chunking, but file_tokens is still counted from raw, so file-level totals stay honest. Code blocks are blanked to spaces of equal length rather than deleted: the header regex runs on text_scan, but each section's content is sliced from the original text, so a # inside a code block can't match as a heading while the section's token count still includes the code it contains. And bounds uses len(text), not len(text_scan), which is safe only because _blank_code preserves character count.

The net effect was about 15 phantom sections disappearing and all frontmatter preamble chunks ceasing to exist.

The manual merge work

With accurate counts in hand, I went through the undersized sections. Three patterns drove most of the reduction.

Cross-reference lists. Around a hundred files carried a ## Cross-references or ## Related pages heading over nothing but a list of wiki links. Structurally these are sections; semantically they are navigation. Collapsing them to an inline See also: line dissolved 102 sections without losing a single link.

Design-pattern facets. Several pages had given every facet its own heading: When to use, When NOT to use, framework equivalents, and cross-references, five to seven thin sections on one page that together make one solid entry. Merging the use-and-avoid headings and folding the equivalents back inline pulled roughly 14 sections into their neighbours.

Bare titles. Many pages went straight from # Title into ## Core idea with nothing between, leaving the title section as pure overhead. A sentence or two of lead-in pushed those above the floor and gave each page an opening line.

Results

Metric	Before	After
Total sections	954	713 (-25%)
Undersized (< 64 tok)	208 (21%)	111 (15%)
Median section size	98	126
Sections/file (median)	9	6

A quarter of the sections were structure, not knowledge.

What's left, and why I'm not fixing it in markdown

111 sections are still under the floor. They split into two kinds.

About 40 are entity index cards: short reference pages where every section is naturally thin because the whole page is thin, 30 to 50 lines of fact. Padding them dilutes the signal. They are small by design.

The other 70 or so are concept titles that landed just under the line, 38 to 63 tokens even after a lead-in. Forcing more text means restating the first paragraph, which helps nothing.

The clean fix lives in the chunker. The rule is simple: if a section is under the threshold and another section follows it in the same file, concatenate the two before embedding. That folds thin concept titles into the body below them and treats a small reference card as one chunk instead of five thin ones. It is a small change in one place and it decouples how I write from how the text embeds, which is where that decision belongs.

The alternative is to accept that the 64-token floor was always a rule of thumb. At a median of 126 tokens the corpus is healthy, and a 48-token section is a perfectly good retrieval unit. Drop the floor and a chunk of the residual stops being a problem.

Either way the lesson holds. A quarter of my section count was structure rather than knowledge, and the cleanest fixes were about what counts as a chunk, not how cleverly I cut one.

Building the baseline

Past the merge step, the corpus is finally homogeneous. All the surviving files are concept entries written to the same schema. The thesis: a clean single-domain index over these entries should retrieve well on its own, without routing or clustering.

The plan: build the dumbest possible baseline, measure recall, and only add complexity if the numbers demand it.

The frozen corpus

The wiki kept growing during the merge work, so I froze a snapshot as corpus-v1 — a JSON manifest with SHA-256 hashes of every source file, treated as immutable for the experiment. Everything downstream points at this fixed snapshot.

Metric	Value
Source files	113
Total chunks	673
Median tokens per chunk	160
p95 tokens	383
Max tokens	568
Oversized (>512 tokens)	6 (0%)
Undersized (<64 tokens)	6 (0%)

The undersized tail that drove the entire merge effort is gone. Six chunks on each side of the bounds is noise.

Chunking strategy

The chunker implements the merge-small-siblings rule from the previous section. Chunks split on ## (H2) headers, then three passes clean up the edges:

Split on H2 boundaries — each section becomes a candidate chunk
Merge consecutive small siblings (<64 tokens each) until hitting a 300-token target
Split oversized sections (>512 tokens) by paragraph

Every chunk gets a breadcrumb prefix prepended before embedding, so the model sees context like "Tokenization Strategies › Part 2 — Document tokenization" at the start of every vector.

Chunk IDs follow the pattern {relative_path}##{section_heading}, giving stable anchors like concepts/SageMaker.md##Training jobs.

chunk_id = f"{rel_path}##{heading}"
breadcrumb = _breadcrumb(file_stem, h1, heading)
full_text = f"{breadcrumb}\n\n{text.strip()}"

The pipeline

Seven steps from frozen corpus to measured recall. Each is one script.

freeze_corpus.py → chunker.py → schema.sql → bench_embedders.py → pgvector → test_retrieval.py → test_e2e_quality.py

Freeze. tools/freeze_corpus.py collects all markdown files, computes SHA-256 hashes, writes corpus-v1.json. The repo is tagged corpus-v1.

Chunk. ingest/chunker.py reads the manifest, chunks every file, writes ingest/chunks.jsonl. Each line is a JSON object with chunk text, breadcrumb, heading, source path, token count, and stable chunk ID.

Schema. ingest/schema.sql creates the pgvector schema — a chunks table with a vector(768) embedding column, an HNSW index for cosine similarity, and a search_chunks() function. All statements are idempotent.

CREATE TABLE IF NOT EXISTS chunks (
    chunk_id    TEXT PRIMARY KEY,
    text        TEXT NOT NULL,
    breadcrumb  TEXT NOT NULL,
    embedding   vector(768)
);

CREATE INDEX chunks_embedding_hnsw ON chunks
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

Benchmark and load. evals/bench_embedders.py runs all candidate models against an eval set of 25 hand-crafted question-to-chunk pairs. For each model it embeds all 673 chunks, embeds all 25 questions, computes recall@k, picks the winner by recall@5 then recall@10 then recall@1 as tiebreaker, and writes the winner's cached vectors straight to pgvector.

winner = max(results, key=lambda r: (
    r["recall"].get("@5", 0),
    r["recall"].get("@10", 0),
    r["recall"].get("@1", 0),
))
write_winner_to_pgvector(winner, chunks, manifest, db_url)

Measure. Two scripts measure quality independently: evals/test_retrieval.py for recall@k and MRR in isolation, and evals/test_e2e_quality.py for end-to-end answer quality using an LLM judge.

Embedder results

Model	dim	r@1	r@3	r@5	r@10
BAAI/bge-small-en-v1.5	384	0.760	0.960	1.000	1.000
BAAI/bge-base-en-v1.5	768	0.800	0.960	1.000	1.000
nomic-ai/nomic-embed-text-v1.5	768	0.880	0.960	1.000	1.000

All three models hit recall@5 = 1.000 — every eval question found its correct chunk in the top 5 results. The differentiation is entirely at r@1: how often the single top result is the right one.

Winner: Nomic (nomic-ai/nomic-embed-text-v1.5), with recall@1 = 0.880 — 22 out of 25 questions had the correct chunk ranked first.

The verdict

The thesis held. A clean single-domain index retrieves perfectly at recall@5 without domain-clustering, reranking, or any of the complexity I assumed I'd need. The clustering idea was a problem I scoped away rather than one I had to engineer around.

The full run:

docker compose up -d
python3 tools/freeze_corpus.py wiki/ --tag corpus-v1
python3 ingest/chunker.py
psql "$DATABASE_URL" -f ingest/schema.sql
python3 evals/bench_embedders.py --write-to-db
python3 evals/test_retrieval.py --mode pgvector
python3 evals/test_e2e_quality.py --retrieval-mode pgvector

What's next

With only 25 eval questions, the gap between r@1 = 0.760 and r@1 = 0.880 is literally 3 questions. The natural next step is expanding the eval set from 25 to 500+ using LLM-generated drafts, measuring end-to-end answer quality with the LLM judge, and logging the final numbers as a decision record. If recall holds on the larger eval set, the baseline becomes production.