[{"data":1,"prerenderedAt":1836},["ShallowReactive",2],{"blog-list":3},[4,1664],{"id":5,"title":6,"body":7,"cover":1650,"date":1651,"description":1652,"draft":1653,"extension":1654,"meta":1655,"navigation":436,"path":1656,"seo":1657,"stem":1658,"tags":1659,"__hash__":1663},"blog\u002Fblog\u002Fmerging-my-wiki.md","Merging my wiki, then building the baseline",{"type":8,"value":9,"toc":1636},"minimark",[10,20,25,28,39,43,61,64,337,340,481,690,730,733,737,740,759,772,786,790,857,860,864,867,870,873,876,879,882,886,889,892,896,903,971,974,978,985,1006,1013,1023,1098,1102,1105,1111,1127,1140,1161,1297,1306,1398,1412,1416,1505,1508,1516,1520,1523,1526,1625,1629,1632],[11,12,13,14,19],"p",{},"This picks up where the ",[15,16,18],"a",{"href":17},"\u002Fblog\u002Fmeasuring-my-wiki","previous post"," ended. I had 95 concept-and-entity files, 801 sections, and 190 of them under the 64-token floor. The plan was to merge the short ones.",[21,22,24],"h2",{"id":23},"the-corpus-moved-while-i-was-working","The corpus moved while I was working",[11,26,27],{},"By the time I ran the merge, the wiki had grown. 95 files had become 109, and the section count had climbed from 801 to 954, with the undersized tail holding steady at 21%. All the numbers below are measured against that updated baseline.",[29,30,35],"pre",{"className":31,"code":33,"language":34},[32],"language-text","FILE-level:    n=109  max=4195  p95=2464  median=966  mean=1093\nSECTION-level: n=954  max=832  p95=328  median=98  mean=124\n  oversized  (> 512): 12 (1%)  -> split these\n  undersized (\u003C 64): 208 (21%)  -> merge candidates\n  sections\u002Ffile: median=9  max=15\n","text",[36,37,33],"code",{"__ignoreMap":38},"",[21,40,42],{"id":41},"a-bug-in-the-measurement-script","A bug in the measurement script",[11,44,45,46,49,50,53,54,57,58,60],{},"Before I could trust any of the counts, I needed to fix the script. The header regex was matching ",[36,47,48],{},"#"," anywhere in the file, including inside fenced code blocks. A bash comment like ",[36,51,52],{},"# Start server"," or a YAML snippet was being read as a heading and spun into a phantom section of five to eight tokens. Frontmatter had a related problem: the ",[36,55,56],{},"--- ... ---"," block had no ",[36,59,48],{}," heading of its own, so it became a preamble chunk of 50 to 63 tokens, pure metadata that nothing would ever retrieve.",[11,62,63],{},"The original per-file loop:",[29,65,69],{"className":66,"code":67,"language":68,"meta":38,"style":38},"language-python shiki shiki-themes github-light","file_tokens, sections, sections_per_file = [], [], []\nfor f in files:\n    text = f.read_text(encoding=\"utf-8\", errors=\"ignore\")\n    file_tokens.append(ntok(text))\n    idxs = [m.start() for m in header_re.finditer(text)]\n    bounds = ([0] if (not idxs or idxs[0] > 0) else []) + idxs + [len(text)]\n    secs = [text[bounds[i]:bounds[i+1]] for i in range(len(bounds)-1)]\n    secs = [s for s in secs if s.strip()]\n    sections_per_file.append(len(secs))\n    sections += [(ntok(s), f) for s in secs]\n","python",[36,70,71,87,102,137,143,164,235,280,305,316],{"__ignoreMap":38},[72,73,76,80,84],"span",{"class":74,"line":75},"line",1,[72,77,79],{"class":78},"sgsFI","file_tokens, sections, sections_per_file ",[72,81,83],{"class":82},"sD7c4","=",[72,85,86],{"class":78}," [], [], []\n",[72,88,90,93,96,99],{"class":74,"line":89},2,[72,91,92],{"class":82},"for",[72,94,95],{"class":78}," f ",[72,97,98],{"class":82},"in",[72,100,101],{"class":78}," files:\n",[72,103,105,108,110,113,117,119,123,126,129,131,134],{"class":74,"line":104},3,[72,106,107],{"class":78},"    text ",[72,109,83],{"class":82},[72,111,112],{"class":78}," f.read_text(",[72,114,116],{"class":115},"sqxcx","encoding",[72,118,83],{"class":82},[72,120,122],{"class":121},"sYBdl","\"utf-8\"",[72,124,125],{"class":78},", ",[72,127,128],{"class":115},"errors",[72,130,83],{"class":82},[72,132,133],{"class":121},"\"ignore\"",[72,135,136],{"class":78},")\n",[72,138,140],{"class":74,"line":139},4,[72,141,142],{"class":78},"    file_tokens.append(ntok(text))\n",[72,144,146,149,151,154,156,159,161],{"class":74,"line":145},5,[72,147,148],{"class":78},"    idxs ",[72,150,83],{"class":82},[72,152,153],{"class":78}," [m.start() ",[72,155,92],{"class":82},[72,157,158],{"class":78}," m ",[72,160,98],{"class":82},[72,162,163],{"class":78}," header_re.finditer(text)]\n",[72,165,167,170,172,175,179,182,185,188,191,194,197,200,202,204,207,210,213,216,219,222,224,226,229,232],{"class":74,"line":166},6,[72,168,169],{"class":78},"    bounds ",[72,171,83],{"class":82},[72,173,174],{"class":78}," ([",[72,176,178],{"class":177},"sYu0t","0",[72,180,181],{"class":78},"] ",[72,183,184],{"class":82},"if",[72,186,187],{"class":78}," (",[72,189,190],{"class":82},"not",[72,192,193],{"class":78}," idxs ",[72,195,196],{"class":82},"or",[72,198,199],{"class":78}," idxs[",[72,201,178],{"class":177},[72,203,181],{"class":78},[72,205,206],{"class":82},">",[72,208,209],{"class":177}," 0",[72,211,212],{"class":78},") ",[72,214,215],{"class":82},"else",[72,217,218],{"class":78}," []) ",[72,220,221],{"class":82},"+",[72,223,193],{"class":78},[72,225,221],{"class":82},[72,227,228],{"class":78}," [",[72,230,231],{"class":177},"len",[72,233,234],{"class":78},"(text)]\n",[72,236,238,241,243,246,248,251,254,256,259,261,264,267,269,272,275,277],{"class":74,"line":237},7,[72,239,240],{"class":78},"    secs ",[72,242,83],{"class":82},[72,244,245],{"class":78}," [text[bounds[i]:bounds[i",[72,247,221],{"class":82},[72,249,250],{"class":177},"1",[72,252,253],{"class":78},"]] ",[72,255,92],{"class":82},[72,257,258],{"class":78}," i ",[72,260,98],{"class":82},[72,262,263],{"class":177}," range",[72,265,266],{"class":78},"(",[72,268,231],{"class":177},[72,270,271],{"class":78},"(bounds)",[72,273,274],{"class":82},"-",[72,276,250],{"class":177},[72,278,279],{"class":78},")]\n",[72,281,283,285,287,290,292,295,297,300,302],{"class":74,"line":282},8,[72,284,240],{"class":78},[72,286,83],{"class":82},[72,288,289],{"class":78}," [s ",[72,291,92],{"class":82},[72,293,294],{"class":78}," s ",[72,296,98],{"class":82},[72,298,299],{"class":78}," secs ",[72,301,184],{"class":82},[72,303,304],{"class":78}," s.strip()]\n",[72,306,308,311,313],{"class":74,"line":307},9,[72,309,310],{"class":78},"    sections_per_file.append(",[72,312,231],{"class":177},[72,314,315],{"class":78},"(secs))\n",[72,317,319,322,325,328,330,332,334],{"class":74,"line":318},10,[72,320,321],{"class":78},"    sections ",[72,323,324],{"class":82},"+=",[72,326,327],{"class":78}," [(ntok(s), f) ",[72,329,92],{"class":82},[72,331,294],{"class":78},[72,333,98],{"class":82},[72,335,336],{"class":78}," secs]\n",[11,338,339],{},"The fix adds two helper regexes and splits which string gets scanned from which gets sliced:",[29,341,343],{"className":66,"code":342,"language":68,"meta":38,"style":38},"# Replace fenced code blocks with spaces of the same length so '#' inside\n# bash\u002Fyaml\u002Fetc. snippets are not mistaken for headings, while character\n# positions and token counts stay accurate.\nfence_re = re.compile(r'```.*?```', re.S)\n# Strip YAML frontmatter so it doesn't become a preamble chunk.\nfrontmatter_re = re.compile(r'\\A---\\n.*?\\n---\\n', re.S)\n\ndef _blank_code(text):\n    return fence_re.sub(lambda m: ' ' * len(m.group(0)), text)\n",[36,344,345,351,356,361,389,394,432,438,450],{"__ignoreMap":38},[72,346,347],{"class":74,"line":75},[72,348,350],{"class":349},"sAwPA","# Replace fenced code blocks with spaces of the same length so '#' inside\n",[72,352,353],{"class":74,"line":89},[72,354,355],{"class":349},"# bash\u002Fyaml\u002Fetc. snippets are not mistaken for headings, while character\n",[72,357,358],{"class":74,"line":104},[72,359,360],{"class":349},"# positions and token counts stay accurate.\n",[72,362,363,366,368,371,374,377,380,383,386],{"class":74,"line":139},[72,364,365],{"class":78},"fence_re ",[72,367,83],{"class":82},[72,369,370],{"class":78}," re.compile(",[72,372,373],{"class":82},"r",[72,375,376],{"class":121},"'```",[72,378,379],{"class":177},".",[72,381,382],{"class":82},"*?",[72,384,385],{"class":121},"```'",[72,387,388],{"class":78},", re.S)\n",[72,390,391],{"class":74,"line":145},[72,392,393],{"class":349},"# Strip YAML frontmatter so it doesn't become a preamble chunk.\n",[72,395,396,399,401,403,405,408,411,414,418,420,422,424,426,428,430],{"class":74,"line":166},[72,397,398],{"class":78},"frontmatter_re ",[72,400,83],{"class":82},[72,402,370],{"class":78},[72,404,373],{"class":82},[72,406,407],{"class":121},"'",[72,409,410],{"class":177},"\\A",[72,412,413],{"class":121},"---",[72,415,417],{"class":416},"s691h","\\n",[72,419,379],{"class":177},[72,421,382],{"class":82},[72,423,417],{"class":416},[72,425,413],{"class":121},[72,427,417],{"class":416},[72,429,407],{"class":121},[72,431,388],{"class":78},[72,433,434],{"class":74,"line":237},[72,435,437],{"emptyLinePlaceholder":436},true,"\n",[72,439,440,443,447],{"class":74,"line":282},[72,441,442],{"class":82},"def",[72,444,446],{"class":445},"s7eDp"," _blank_code",[72,448,449],{"class":78},"(text):\n",[72,451,452,455,458,461,464,467,470,473,476,478],{"class":74,"line":307},[72,453,454],{"class":82},"    return",[72,456,457],{"class":78}," fence_re.sub(",[72,459,460],{"class":82},"lambda",[72,462,463],{"class":78}," m: ",[72,465,466],{"class":121},"' '",[72,468,469],{"class":82}," *",[72,471,472],{"class":177}," len",[72,474,475],{"class":78},"(m.group(",[72,477,178],{"class":177},[72,479,480],{"class":78},")), text)\n",[29,482,484],{"className":66,"code":483,"language":68,"meta":38,"style":38},"    raw = f.read_text(encoding=\"utf-8\", errors=\"ignore\")\n    file_tokens.append(ntok(raw))\n    text = frontmatter_re.sub('', raw)\n    text_scan = _blank_code(text)   # code blocks blanked; positions preserved\n    idxs = [m.start() for m in header_re.finditer(text_scan)]\n    bounds = ([0] if (not idxs or idxs[0] > 0) else []) + idxs + [len(text)]\n    secs = [text[bounds[i]:bounds[i+1]] for i in range(len(bounds)-1)]\n    secs = [s for s in secs if s.strip()]\n    sections_per_file.append(len(secs))\n    sections += [(ntok(s), f, _heading(s)) for s in secs]\n",[36,485,486,511,516,531,544,561,611,645,665,673],{"__ignoreMap":38},[72,487,488,491,493,495,497,499,501,503,505,507,509],{"class":74,"line":75},[72,489,490],{"class":78},"    raw ",[72,492,83],{"class":82},[72,494,112],{"class":78},[72,496,116],{"class":115},[72,498,83],{"class":82},[72,500,122],{"class":121},[72,502,125],{"class":78},[72,504,128],{"class":115},[72,506,83],{"class":82},[72,508,133],{"class":121},[72,510,136],{"class":78},[72,512,513],{"class":74,"line":89},[72,514,515],{"class":78},"    file_tokens.append(ntok(raw))\n",[72,517,518,520,522,525,528],{"class":74,"line":104},[72,519,107],{"class":78},[72,521,83],{"class":82},[72,523,524],{"class":78}," frontmatter_re.sub(",[72,526,527],{"class":121},"''",[72,529,530],{"class":78},", raw)\n",[72,532,533,536,538,541],{"class":74,"line":139},[72,534,535],{"class":78},"    text_scan ",[72,537,83],{"class":82},[72,539,540],{"class":78}," _blank_code(text)   ",[72,542,543],{"class":349},"# code blocks blanked; positions preserved\n",[72,545,546,548,550,552,554,556,558],{"class":74,"line":145},[72,547,148],{"class":78},[72,549,83],{"class":82},[72,551,153],{"class":78},[72,553,92],{"class":82},[72,555,158],{"class":78},[72,557,98],{"class":82},[72,559,560],{"class":78}," header_re.finditer(text_scan)]\n",[72,562,563,565,567,569,571,573,575,577,579,581,583,585,587,589,591,593,595,597,599,601,603,605,607,609],{"class":74,"line":166},[72,564,169],{"class":78},[72,566,83],{"class":82},[72,568,174],{"class":78},[72,570,178],{"class":177},[72,572,181],{"class":78},[72,574,184],{"class":82},[72,576,187],{"class":78},[72,578,190],{"class":82},[72,580,193],{"class":78},[72,582,196],{"class":82},[72,584,199],{"class":78},[72,586,178],{"class":177},[72,588,181],{"class":78},[72,590,206],{"class":82},[72,592,209],{"class":177},[72,594,212],{"class":78},[72,596,215],{"class":82},[72,598,218],{"class":78},[72,600,221],{"class":82},[72,602,193],{"class":78},[72,604,221],{"class":82},[72,606,228],{"class":78},[72,608,231],{"class":177},[72,610,234],{"class":78},[72,612,613,615,617,619,621,623,625,627,629,631,633,635,637,639,641,643],{"class":74,"line":237},[72,614,240],{"class":78},[72,616,83],{"class":82},[72,618,245],{"class":78},[72,620,221],{"class":82},[72,622,250],{"class":177},[72,624,253],{"class":78},[72,626,92],{"class":82},[72,628,258],{"class":78},[72,630,98],{"class":82},[72,632,263],{"class":177},[72,634,266],{"class":78},[72,636,231],{"class":177},[72,638,271],{"class":78},[72,640,274],{"class":82},[72,642,250],{"class":177},[72,644,279],{"class":78},[72,646,647,649,651,653,655,657,659,661,663],{"class":74,"line":282},[72,648,240],{"class":78},[72,650,83],{"class":82},[72,652,289],{"class":78},[72,654,92],{"class":82},[72,656,294],{"class":78},[72,658,98],{"class":82},[72,660,299],{"class":78},[72,662,184],{"class":82},[72,664,304],{"class":78},[72,666,667,669,671],{"class":74,"line":307},[72,668,310],{"class":78},[72,670,231],{"class":177},[72,672,315],{"class":78},[72,674,675,677,679,682,684,686,688],{"class":74,"line":318},[72,676,321],{"class":78},[72,678,324],{"class":82},[72,680,681],{"class":78}," [(ntok(s), f, _heading(s)) ",[72,683,92],{"class":82},[72,685,294],{"class":78},[72,687,98],{"class":82},[72,689,336],{"class":78},[11,691,692,693,695,696,699,700,703,704,707,708,710,711,713,714,717,718,721,722,725,726,729],{},"Three things happen here. Frontmatter is stripped from ",[36,694,34],{}," before chunking, but ",[36,697,698],{},"file_tokens"," is still counted from ",[36,701,702],{},"raw",", so file-level totals stay honest. Code blocks are blanked to spaces of equal length rather than deleted: the header regex runs on ",[36,705,706],{},"text_scan",", but each section's content is sliced from the original ",[36,709,34],{},", so a ",[36,712,48],{}," inside a code block can't match as a heading while the section's token count still includes the code it contains. And ",[36,715,716],{},"bounds"," uses ",[36,719,720],{},"len(text)",", not ",[36,723,724],{},"len(text_scan)",", which is safe only because ",[36,727,728],{},"_blank_code"," preserves character count.",[11,731,732],{},"The net effect was about 15 phantom sections disappearing and all frontmatter preamble chunks ceasing to exist.",[21,734,736],{"id":735},"the-manual-merge-work","The manual merge work",[11,738,739],{},"With accurate counts in hand, I went through the undersized sections. Three patterns drove most of the reduction.",[11,741,742,746,747,750,751,754,755,758],{},[743,744,745],"strong",{},"Cross-reference lists."," Around a hundred files carried a ",[36,748,749],{},"## Cross-references"," or ",[36,752,753],{},"## Related pages"," heading over nothing but a list of wiki links. Structurally these are sections; semantically they are navigation. Collapsing them to an inline ",[36,756,757],{},"See also:"," line dissolved 102 sections without losing a single link.",[11,760,761,764,765,125,768,771],{},[743,762,763],{},"Design-pattern facets."," Several pages had given every facet its own heading: ",[36,766,767],{},"When to use",[36,769,770],{},"When NOT to use",", framework equivalents, and cross-references, five to seven thin sections on one page that together make one solid entry. Merging the use-and-avoid headings and folding the equivalents back inline pulled roughly 14 sections into their neighbours.",[11,773,774,777,778,781,782,785],{},[743,775,776],{},"Bare titles."," Many pages went straight from ",[36,779,780],{},"# Title"," into ",[36,783,784],{},"## Core idea"," with nothing between, leaving the title section as pure overhead. A sentence or two of lead-in pushed those above the floor and gave each page an opening line.",[21,787,789],{"id":788},"results","Results",[791,792,793,809],"table",{},[794,795,796],"thead",{},[797,798,799,803,806],"tr",{},[800,801,802],"th",{},"Metric",[800,804,805],{},"Before",[800,807,808],{},"After",[810,811,812,824,835,846],"tbody",{},[797,813,814,818,821],{},[815,816,817],"td",{},"Total sections",[815,819,820],{},"954",[815,822,823],{},"713 (-25%)",[797,825,826,829,832],{},[815,827,828],{},"Undersized (\u003C 64 tok)",[815,830,831],{},"208 (21%)",[815,833,834],{},"111 (15%)",[797,836,837,840,843],{},[815,838,839],{},"Median section size",[815,841,842],{},"98",[815,844,845],{},"126",[797,847,848,851,854],{},[815,849,850],{},"Sections\u002Ffile (median)",[815,852,853],{},"9",[815,855,856],{},"6",[11,858,859],{},"A quarter of the sections were structure, not knowledge.",[21,861,863],{"id":862},"whats-left-and-why-im-not-fixing-it-in-markdown","What's left, and why I'm not fixing it in markdown",[11,865,866],{},"111 sections are still under the floor. They split into two kinds.",[11,868,869],{},"About 40 are entity index cards: short reference pages where every section is naturally thin because the whole page is thin, 30 to 50 lines of fact. Padding them dilutes the signal. They are small by design.",[11,871,872],{},"The other 70 or so are concept titles that landed just under the line, 38 to 63 tokens even after a lead-in. Forcing more text means restating the first paragraph, which helps nothing.",[11,874,875],{},"The clean fix lives in the chunker. The rule is simple: if a section is under the threshold and another section follows it in the same file, concatenate the two before embedding. That folds thin concept titles into the body below them and treats a small reference card as one chunk instead of five thin ones. It is a small change in one place and it decouples how I write from how the text embeds, which is where that decision belongs.",[11,877,878],{},"The alternative is to accept that the 64-token floor was always a rule of thumb. At a median of 126 tokens the corpus is healthy, and a 48-token section is a perfectly good retrieval unit. Drop the floor and a chunk of the residual stops being a problem.",[11,880,881],{},"Either way the lesson holds. A quarter of my section count was structure rather than knowledge, and the cleanest fixes were about what counts as a chunk, not how cleverly I cut one.",[21,883,885],{"id":884},"building-the-baseline","Building the baseline",[11,887,888],{},"Past the merge step, the corpus is finally homogeneous. All the surviving files are concept entries written to the same schema. The thesis: a clean single-domain index over these entries should retrieve well on its own, without routing or clustering.",[11,890,891],{},"The plan: build the dumbest possible baseline, measure recall, and only add complexity if the numbers demand it.",[21,893,895],{"id":894},"the-frozen-corpus","The frozen corpus",[11,897,898,899,902],{},"The wiki kept growing during the merge work, so I froze a snapshot as ",[36,900,901],{},"corpus-v1"," — a JSON manifest with SHA-256 hashes of every source file, treated as immutable for the experiment. Everything downstream points at this fixed snapshot.",[791,904,905,914],{},[794,906,907],{},[797,908,909,911],{},[800,910,802],{},[800,912,913],{},"Value",[810,915,916,924,932,940,948,956,964],{},[797,917,918,921],{},[815,919,920],{},"Source files",[815,922,923],{},"113",[797,925,926,929],{},[815,927,928],{},"Total chunks",[815,930,931],{},"673",[797,933,934,937],{},[815,935,936],{},"Median tokens per chunk",[815,938,939],{},"160",[797,941,942,945],{},[815,943,944],{},"p95 tokens",[815,946,947],{},"383",[797,949,950,953],{},[815,951,952],{},"Max tokens",[815,954,955],{},"568",[797,957,958,961],{},[815,959,960],{},"Oversized (>512 tokens)",[815,962,963],{},"6 (0%)",[797,965,966,969],{},[815,967,968],{},"Undersized (\u003C64 tokens)",[815,970,963],{},[11,972,973],{},"The undersized tail that drove the entire merge effort is gone. Six chunks on each side of the bounds is noise.",[21,975,977],{"id":976},"chunking-strategy","Chunking strategy",[11,979,980,981,984],{},"The chunker implements the merge-small-siblings rule from the previous section. Chunks split on ",[36,982,983],{},"##"," (H2) headers, then three passes clean up the edges:",[986,987,988,995,1001],"ol",{},[989,990,991,994],"li",{},[743,992,993],{},"Split"," on H2 boundaries — each section becomes a candidate chunk",[989,996,997,1000],{},[743,998,999],{},"Merge"," consecutive small siblings (\u003C64 tokens each) until hitting a 300-token target",[989,1002,1003,1005],{},[743,1004,993],{}," oversized sections (>512 tokens) by paragraph",[11,1007,1008,1009,1012],{},"Every chunk gets a breadcrumb prefix prepended before embedding, so the model sees context like ",[36,1010,1011],{},"\"Tokenization Strategies › Part 2 — Document tokenization\""," at the start of every vector.",[11,1014,1015,1016,1019,1020,379],{},"Chunk IDs follow the pattern ",[36,1017,1018],{},"{relative_path}##{section_heading}",", giving stable anchors like ",[36,1021,1022],{},"concepts\u002FSageMaker.md##Training jobs",[29,1024,1026],{"className":66,"code":1025,"language":68,"meta":38,"style":38},"chunk_id = f\"{rel_path}##{heading}\"\nbreadcrumb = _breadcrumb(file_stem, h1, heading)\nfull_text = f\"{breadcrumb}\\n\\n{text.strip()}\"\n",[36,1027,1028,1062,1072],{"__ignoreMap":38},[72,1029,1030,1033,1035,1038,1041,1044,1047,1050,1052,1054,1057,1059],{"class":74,"line":75},[72,1031,1032],{"class":78},"chunk_id ",[72,1034,83],{"class":82},[72,1036,1037],{"class":82}," f",[72,1039,1040],{"class":121},"\"",[72,1042,1043],{"class":177},"{",[72,1045,1046],{"class":78},"rel_path",[72,1048,1049],{"class":177},"}",[72,1051,983],{"class":121},[72,1053,1043],{"class":177},[72,1055,1056],{"class":78},"heading",[72,1058,1049],{"class":177},[72,1060,1061],{"class":121},"\"\n",[72,1063,1064,1067,1069],{"class":74,"line":89},[72,1065,1066],{"class":78},"breadcrumb ",[72,1068,83],{"class":82},[72,1070,1071],{"class":78}," _breadcrumb(file_stem, h1, heading)\n",[72,1073,1074,1077,1079,1081,1083,1085,1088,1091,1094,1096],{"class":74,"line":104},[72,1075,1076],{"class":78},"full_text ",[72,1078,83],{"class":82},[72,1080,1037],{"class":82},[72,1082,1040],{"class":121},[72,1084,1043],{"class":177},[72,1086,1087],{"class":78},"breadcrumb",[72,1089,1090],{"class":177},"}\\n\\n{",[72,1092,1093],{"class":78},"text.strip()",[72,1095,1049],{"class":177},[72,1097,1061],{"class":121},[21,1099,1101],{"id":1100},"the-pipeline","The pipeline",[11,1103,1104],{},"Seven steps from frozen corpus to measured recall. Each is one script.",[29,1106,1109],{"className":1107,"code":1108,"language":34},[32],"freeze_corpus.py → chunker.py → schema.sql → bench_embedders.py → pgvector → test_retrieval.py → test_e2e_quality.py\n",[36,1110,1108],{"__ignoreMap":38},[11,1112,1113,1116,1117,1120,1121,1124,1125,379],{},[743,1114,1115],{},"Freeze."," ",[36,1118,1119],{},"tools\u002Ffreeze_corpus.py"," collects all markdown files, computes SHA-256 hashes, writes ",[36,1122,1123],{},"corpus-v1.json",". The repo is tagged ",[36,1126,901],{},[11,1128,1129,1116,1132,1135,1136,1139],{},[743,1130,1131],{},"Chunk.",[36,1133,1134],{},"ingest\u002Fchunker.py"," reads the manifest, chunks every file, writes ",[36,1137,1138],{},"ingest\u002Fchunks.jsonl",". Each line is a JSON object with chunk text, breadcrumb, heading, source path, token count, and stable chunk ID.",[11,1141,1142,1116,1145,1148,1149,1152,1153,1156,1157,1160],{},[743,1143,1144],{},"Schema.",[36,1146,1147],{},"ingest\u002Fschema.sql"," creates the pgvector schema — a ",[36,1150,1151],{},"chunks"," table with a ",[36,1154,1155],{},"vector(768)"," embedding column, an HNSW index for cosine similarity, and a ",[36,1158,1159],{},"search_chunks()"," function. All statements are idempotent.",[29,1162,1166],{"className":1163,"code":1164,"language":1165,"meta":38,"style":38},"language-sql shiki shiki-themes github-light","CREATE TABLE IF NOT EXISTS chunks (\n    chunk_id    TEXT PRIMARY KEY,\n    text        TEXT NOT NULL,\n    breadcrumb  TEXT NOT NULL,\n    embedding   vector(768)\n);\n\nCREATE INDEX chunks_embedding_hnsw ON chunks\n    USING hnsw (embedding vector_cosine_ops)\n    WITH (m = 16, ef_construction = 64);\n","sql",[36,1167,1168,1188,1202,1215,1226,1241,1246,1250,1266,1274],{"__ignoreMap":38},[72,1169,1170,1173,1176,1179,1182,1185],{"class":74,"line":75},[72,1171,1172],{"class":82},"CREATE",[72,1174,1175],{"class":82}," TABLE",[72,1177,1178],{"class":445}," IF",[72,1180,1181],{"class":82}," NOT",[72,1183,1184],{"class":82}," EXISTS",[72,1186,1187],{"class":78}," chunks (\n",[72,1189,1190,1193,1196,1199],{"class":74,"line":89},[72,1191,1192],{"class":78},"    chunk_id    ",[72,1194,1195],{"class":82},"TEXT",[72,1197,1198],{"class":82}," PRIMARY KEY",[72,1200,1201],{"class":78},",\n",[72,1203,1204,1207,1210,1213],{"class":74,"line":104},[72,1205,1206],{"class":82},"    text",[72,1208,1209],{"class":82},"        TEXT",[72,1211,1212],{"class":82}," NOT NULL",[72,1214,1201],{"class":78},[72,1216,1217,1220,1222,1224],{"class":74,"line":139},[72,1218,1219],{"class":78},"    breadcrumb  ",[72,1221,1195],{"class":82},[72,1223,1212],{"class":82},[72,1225,1201],{"class":78},[72,1227,1228,1231,1234,1236,1239],{"class":74,"line":145},[72,1229,1230],{"class":78},"    embedding   ",[72,1232,1233],{"class":82},"vector",[72,1235,266],{"class":78},[72,1237,1238],{"class":177},"768",[72,1240,136],{"class":78},[72,1242,1243],{"class":74,"line":166},[72,1244,1245],{"class":78},");\n",[72,1247,1248],{"class":74,"line":237},[72,1249,437],{"emptyLinePlaceholder":436},[72,1251,1252,1254,1257,1260,1263],{"class":74,"line":282},[72,1253,1172],{"class":82},[72,1255,1256],{"class":82}," INDEX",[72,1258,1259],{"class":445}," chunks_embedding_hnsw",[72,1261,1262],{"class":82}," ON",[72,1264,1265],{"class":78}," chunks\n",[72,1267,1268,1271],{"class":74,"line":307},[72,1269,1270],{"class":82},"    USING",[72,1272,1273],{"class":78}," hnsw (embedding vector_cosine_ops)\n",[72,1275,1276,1279,1282,1284,1287,1290,1292,1295],{"class":74,"line":318},[72,1277,1278],{"class":82},"    WITH",[72,1280,1281],{"class":78}," (m ",[72,1283,83],{"class":82},[72,1285,1286],{"class":177}," 16",[72,1288,1289],{"class":78},", ef_construction ",[72,1291,83],{"class":82},[72,1293,1294],{"class":177}," 64",[72,1296,1245],{"class":78},[11,1298,1299,1116,1302,1305],{},[743,1300,1301],{},"Benchmark and load.",[36,1303,1304],{},"evals\u002Fbench_embedders.py"," runs all candidate models against an eval set of 25 hand-crafted question-to-chunk pairs. For each model it embeds all 673 chunks, embeds all 25 questions, computes recall@k, picks the winner by recall@5 then recall@10 then recall@1 as tiebreaker, and writes the winner's cached vectors straight to pgvector.",[29,1307,1309],{"className":66,"code":1308,"language":68,"meta":38,"style":38},"winner = max(results, key=lambda r: (\n    r[\"recall\"].get(\"@5\", 0),\n    r[\"recall\"].get(\"@10\", 0),\n    r[\"recall\"].get(\"@1\", 0),\n))\nwrite_winner_to_pgvector(winner, chunks, manifest, db_url)\n",[36,1310,1311,1333,1354,1371,1388,1393],{"__ignoreMap":38},[72,1312,1313,1316,1318,1321,1324,1327,1330],{"class":74,"line":75},[72,1314,1315],{"class":78},"winner ",[72,1317,83],{"class":82},[72,1319,1320],{"class":177}," max",[72,1322,1323],{"class":78},"(results, ",[72,1325,1326],{"class":115},"key",[72,1328,1329],{"class":82},"=lambda",[72,1331,1332],{"class":78}," r: (\n",[72,1334,1335,1338,1341,1344,1347,1349,1351],{"class":74,"line":89},[72,1336,1337],{"class":78},"    r[",[72,1339,1340],{"class":121},"\"recall\"",[72,1342,1343],{"class":78},"].get(",[72,1345,1346],{"class":121},"\"@5\"",[72,1348,125],{"class":78},[72,1350,178],{"class":177},[72,1352,1353],{"class":78},"),\n",[72,1355,1356,1358,1360,1362,1365,1367,1369],{"class":74,"line":104},[72,1357,1337],{"class":78},[72,1359,1340],{"class":121},[72,1361,1343],{"class":78},[72,1363,1364],{"class":121},"\"@10\"",[72,1366,125],{"class":78},[72,1368,178],{"class":177},[72,1370,1353],{"class":78},[72,1372,1373,1375,1377,1379,1382,1384,1386],{"class":74,"line":139},[72,1374,1337],{"class":78},[72,1376,1340],{"class":121},[72,1378,1343],{"class":78},[72,1380,1381],{"class":121},"\"@1\"",[72,1383,125],{"class":78},[72,1385,178],{"class":177},[72,1387,1353],{"class":78},[72,1389,1390],{"class":74,"line":145},[72,1391,1392],{"class":78},"))\n",[72,1394,1395],{"class":74,"line":166},[72,1396,1397],{"class":78},"write_winner_to_pgvector(winner, chunks, manifest, db_url)\n",[11,1399,1400,1403,1404,1407,1408,1411],{},[743,1401,1402],{},"Measure."," Two scripts measure quality independently: ",[36,1405,1406],{},"evals\u002Ftest_retrieval.py"," for recall@k and MRR in isolation, and ",[36,1409,1410],{},"evals\u002Ftest_e2e_quality.py"," for end-to-end answer quality using an LLM judge.",[21,1413,1415],{"id":1414},"embedder-results","Embedder results",[791,1417,1418,1440],{},[794,1419,1420],{},[797,1421,1422,1425,1428,1431,1434,1437],{},[800,1423,1424],{},"Model",[800,1426,1427],{},"dim",[800,1429,1430],{},"r@1",[800,1432,1433],{},"r@3",[800,1435,1436],{},"r@5",[800,1438,1439],{},"r@10",[810,1441,1442,1461,1477],{},[797,1443,1444,1447,1450,1453,1456,1459],{},[815,1445,1446],{},"BAAI\u002Fbge-small-en-v1.5",[815,1448,1449],{},"384",[815,1451,1452],{},"0.760",[815,1454,1455],{},"0.960",[815,1457,1458],{},"1.000",[815,1460,1458],{},[797,1462,1463,1466,1468,1471,1473,1475],{},[815,1464,1465],{},"BAAI\u002Fbge-base-en-v1.5",[815,1467,1238],{},[815,1469,1470],{},"0.800",[815,1472,1455],{},[815,1474,1458],{},[815,1476,1458],{},[797,1478,1479,1484,1488,1493,1497,1501],{},[815,1480,1481],{},[743,1482,1483],{},"nomic-ai\u002Fnomic-embed-text-v1.5",[815,1485,1486],{},[743,1487,1238],{},[815,1489,1490],{},[743,1491,1492],{},"0.880",[815,1494,1495],{},[743,1496,1455],{},[815,1498,1499],{},[743,1500,1458],{},[815,1502,1503],{},[743,1504,1458],{},[11,1506,1507],{},"All three models hit recall@5 = 1.000 — every eval question found its correct chunk in the top 5 results. The differentiation is entirely at r@1: how often the single top result is the right one.",[11,1509,1510,187,1513,1515],{},[743,1511,1512],{},"Winner: Nomic",[36,1514,1483],{},"), with recall@1 = 0.880 — 22 out of 25 questions had the correct chunk ranked first.",[21,1517,1519],{"id":1518},"the-verdict","The verdict",[11,1521,1522],{},"The thesis held. A clean single-domain index retrieves perfectly at recall@5 without domain-clustering, reranking, or any of the complexity I assumed I'd need. The clustering idea was a problem I scoped away rather than one I had to engineer around.",[11,1524,1525],{},"The full run:",[29,1527,1531],{"className":1528,"code":1529,"language":1530,"meta":38,"style":38},"language-bash shiki shiki-themes github-light","docker compose up -d\npython3 tools\u002Ffreeze_corpus.py wiki\u002F --tag corpus-v1\npython3 ingest\u002Fchunker.py\npsql \"$DATABASE_URL\" -f ingest\u002Fschema.sql\npython3 evals\u002Fbench_embedders.py --write-to-db\npython3 evals\u002Ftest_retrieval.py --mode pgvector\npython3 evals\u002Ftest_e2e_quality.py --retrieval-mode pgvector\n","bash",[36,1532,1533,1547,1564,1571,1590,1600,1613],{"__ignoreMap":38},[72,1534,1535,1538,1541,1544],{"class":74,"line":75},[72,1536,1537],{"class":445},"docker",[72,1539,1540],{"class":121}," compose",[72,1542,1543],{"class":121}," up",[72,1545,1546],{"class":177}," -d\n",[72,1548,1549,1552,1555,1558,1561],{"class":74,"line":89},[72,1550,1551],{"class":445},"python3",[72,1553,1554],{"class":121}," tools\u002Ffreeze_corpus.py",[72,1556,1557],{"class":121}," wiki\u002F",[72,1559,1560],{"class":177}," --tag",[72,1562,1563],{"class":121}," corpus-v1\n",[72,1565,1566,1568],{"class":74,"line":104},[72,1567,1551],{"class":445},[72,1569,1570],{"class":121}," ingest\u002Fchunker.py\n",[72,1572,1573,1576,1579,1582,1584,1587],{"class":74,"line":139},[72,1574,1575],{"class":445},"psql",[72,1577,1578],{"class":121}," \"",[72,1580,1581],{"class":78},"$DATABASE_URL",[72,1583,1040],{"class":121},[72,1585,1586],{"class":177}," -f",[72,1588,1589],{"class":121}," ingest\u002Fschema.sql\n",[72,1591,1592,1594,1597],{"class":74,"line":145},[72,1593,1551],{"class":445},[72,1595,1596],{"class":121}," evals\u002Fbench_embedders.py",[72,1598,1599],{"class":177}," --write-to-db\n",[72,1601,1602,1604,1607,1610],{"class":74,"line":166},[72,1603,1551],{"class":445},[72,1605,1606],{"class":121}," evals\u002Ftest_retrieval.py",[72,1608,1609],{"class":177}," --mode",[72,1611,1612],{"class":121}," pgvector\n",[72,1614,1615,1617,1620,1623],{"class":74,"line":237},[72,1616,1551],{"class":445},[72,1618,1619],{"class":121}," evals\u002Ftest_e2e_quality.py",[72,1621,1622],{"class":177}," --retrieval-mode",[72,1624,1612],{"class":121},[21,1626,1628],{"id":1627},"whats-next","What's next",[11,1630,1631],{},"With only 25 eval questions, the gap between r@1 = 0.760 and r@1 = 0.880 is literally 3 questions. The natural next step is expanding the eval set from 25 to 500+ using LLM-generated drafts, measuring end-to-end answer quality with the LLM judge, and logging the final numbers as a decision record. If recall holds on the larger eval set, the baseline becomes production.",[1633,1634,1635],"style",{},"html pre.shiki code .sgsFI, html code.shiki .sgsFI{--shiki-default:#24292E}html pre.shiki code .sD7c4, html code.shiki .sD7c4{--shiki-default:#D73A49}html pre.shiki code .sqxcx, html code.shiki .sqxcx{--shiki-default:#E36209}html pre.shiki code .sYBdl, html code.shiki .sYBdl{--shiki-default:#032F62}html pre.shiki code .sYu0t, html code.shiki .sYu0t{--shiki-default:#005CC5}html .default .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html pre.shiki code .sAwPA, html code.shiki .sAwPA{--shiki-default:#6A737D}html pre.shiki code .s691h, html code.shiki .s691h{--shiki-default:#22863A;--shiki-default-font-weight:bold}html pre.shiki code .s7eDp, html code.shiki .s7eDp{--shiki-default:#6F42C1}",{"title":38,"searchDepth":89,"depth":89,"links":1637},[1638,1639,1640,1641,1642,1643,1644,1645,1646,1647,1648,1649],{"id":23,"depth":89,"text":24},{"id":41,"depth":89,"text":42},{"id":735,"depth":89,"text":736},{"id":788,"depth":89,"text":789},{"id":862,"depth":89,"text":863},{"id":884,"depth":89,"text":885},{"id":894,"depth":89,"text":895},{"id":976,"depth":89,"text":977},{"id":1100,"depth":89,"text":1101},{"id":1414,"depth":89,"text":1415},{"id":1518,"depth":89,"text":1519},{"id":1627,"depth":89,"text":1628},null,"2026-06-09","A 25% drop in section count, a measurement bug, a retrieval pipeline that didn't need the clever parts, and recall@5 = 1.000 on the first try.",false,"md",{},"\u002Fblog\u002Fmerging-my-wiki",{"title":6,"description":1652},"blog\u002Fmerging-my-wiki",[1660,1661,1662],"rag","embeddings","chunking","yQR9No1Mt66F93rUDsX49e04WNclBfoUDJDR13nLddU",{"id":1665,"title":1666,"body":1667,"cover":1650,"date":1829,"description":1830,"draft":1653,"extension":1654,"meta":1831,"navigation":436,"path":17,"seo":1832,"stem":1833,"tags":1834,"__hash__":1835},"blog\u002Fblog\u002Fmeasuring-my-wiki.md","How measuring my own wiki kept proving me wrong",{"type":8,"value":1668,"toc":1820},[1669,1673,1676,1686,1693,1696,1700,1703,1706,1710,1713,1716,1731,1737,1740,1744,1747,1750,1757,1763,1767,1770,1773,1776,1780,1789,1792,1798,1801,1812,1814,1817],[21,1670,1672],{"id":1671},"it-started-with-laziness","It started with laziness",[11,1674,1675],{},"I wanted to automate my interview prep. The idea was to build a knowledge base from my notes and use it to simulate interviews, surfacing the gaps I needed to close rather than redoing the same manual review before every application.",[11,1677,1678,1679,1685],{},"So I built a wiki. The shape came from ",[15,1680,1684],{"href":1681,"rel":1682},"https:\u002F\u002Fgist.github.com\u002Fkarpathy\u002F442a6bf555914893e9891c11519de94f",[1683],"nofollow","Karpathy's LLM Wiki gist",". Rather than re-derive knowledge from raw sources on every query, you let the LLM incrementally build and maintain a persistent, cross-linked wiki. Raw sources stay immutable. The LLM owns the wiki layer. A schema file tells it how to behave.",[11,1687,1688,1689,1692],{},"I extended that in two directions. Every agent needed the same orientation, so the shared instructions went into a top-level ",[36,1690,1691],{},"AGENTS.md"," that any agent reads on arrival. And because one agent doing everything was a mess, I split the work into seven single-purpose agents, each writing only where its mandate allows. Of those seven, LIBRARIAN is the one that matters for this story. It restructures notes, merges stubs, expands thin sections, and maintains cross-links. Every other agent reads what it maintains.",[11,1694,1695],{},"The wiki is the single source of truth. Agents compound on each other through it, never through private state.",[21,1697,1699],{"id":1698},"_250-pages-later-the-tokens-added-up","250 pages later, the tokens added up",[11,1701,1702],{},"At around 250 pages, \"give the agent what it needs to know\" had quietly become \"give the agent a big slice of the wiki,\" and the token bill showed it. Time for retrieval: store the wiki as vectors, pull back only the few chunks each query needs.",[11,1704,1705],{},"So I built a RAG, and immediately hit the question that turned out to be the whole story: what granularity should the chunks be?",[21,1707,1709],{"id":1708},"first-wrong-turn-thinking-the-files-were-already-my-chunks","First wrong turn: thinking the files were already my chunks",[11,1711,1712],{},"My notes have headers and sections, so I figured I had semantic chunks for free. One file, one chunk. Done.",[11,1714,1715],{},"Then I measured. My files average eight sections each, so one vector per file smears eight topics together, and a query about one of them drags back the whole file. Splitting on headers fixed that. Section-level it is.",[11,1717,1718,1719,1722,1723,1726,1727,1730],{},"The same measurement caught a different mistake before I made it. I had planned to find my biggest file and buy an embedder whose context window could swallow it. My biggest files were ",[36,1720,1721],{},"log.md"," at 14,501 tokens, ",[36,1724,1725],{},"backlog.md"," at 11,417, and ",[36,1728,1729],{},"index.md"," at 7,976. None of them are knowledge. They are plumbing. I nearly sized my whole embedder choice around a file that should not be in the corpus at all.",[29,1732,1735],{"className":1733,"code":1734,"language":34},[32],"ROOT: ~\u002Fwiki\nincluded files: 251   excluded: 0\ntotal tokens (content only): 310633\n\nFILE-level:    n=251  max=14501  p95=2484  median=972  mean=1237\nSECTION-level: n=1969  max=4685   p95=459   median=104  mean=157\n  oversized  (> 512): 72 (3%)   -> split candidates\n  undersized (\u003C 64):  483 (24%) -> merge candidates\n  sections\u002Ffile: median=8 max=52\n\nbiggest sections:\n  4685  wiki\u002Fbacklog.md\n  4479  wiki\u002Fbacklog.md\n  2933  wiki\u002Fdecisions\u002Fledger.md\n  2146  wiki\u002Fdebriefs\u002F2026-05-27-[company]-[person].md\n  1902  wiki\u002Findex.md\n",[36,1736,1734],{"__ignoreMap":38},[11,1738,1739],{},"Lesson: scope the corpus before you measure it, not after.",[21,1741,1743],{"id":1742},"the-corpus-not-the-chunk-size","The corpus, not the chunk size",[11,1745,1746],{},"So I dropped the plumbing and measured content only. The corpus itself drives every later decision.",[11,1748,1749],{},"Even without plumbing, my biggest sections were not the material I cared about. They were interview transcripts and debriefs. My wiki keeps very different kinds of writing side by side: interview prep, debriefs, AWS notes. All useful to me, all shaped differently, and one chunking rule across the lot is a compromise everywhere.",[11,1751,1752,1753,1756],{},"The tail surprised me too. I had braced for chunks too big for the embedder. The real problem was the reverse. 27% of my sections are under 64 tokens, too thin to retrieve well on their own. (The 64\u002F512 boundaries are rules of thumb for the class of embedders I'm evaluating, small models in the ",[36,1754,1755],{},"text-embedding-3-small"," range, where the sweet spot sits between roughly 50 and 512 tokens. I haven't locked a final model yet; the point of measuring first is to let the data inform that choice rather than the reverse.)",[29,1758,1761],{"className":1759,"code":1760,"language":34},[32],"ROOT: ~\u002Fwiki\nincluded files: 227   excluded: 24 (index.md x20, backlog.md, ledger.md,\n                                     2026-05-27-[company]-[person].md, log.md)\ntotal tokens (content only): 229689\n\nFILE-level:    n=227  max=4647  p95=2028  median=889  mean=1011\nSECTION-level: n=1686 max=1369  p95=417   median=96   mean=136\n  oversized  (> 512): 48 (2%)   -> split candidates\n  undersized (\u003C 64):  458 (27%) -> merge candidates\n  sections\u002Ffile: median=7 max=15\n\nbiggest sections:\n  1369  wiki\u002Finterviews\u002F2026-05-26-[company].md\n  1156  wiki\u002Finterviews\u002F2026-05-26-[company].md\n  1155  wiki\u002Finterviews\u002F2026-05-28-dry-run.md\n  1006  wiki\u002Fsources\u002F[company]\u002Fmotivation_raw.md\n   988  wiki\u002Finterviews\u002F2026-05-26-[company]-[role].md\n",[36,1762,1760],{"__ignoreMap":38},[21,1764,1766],{"id":1765},"what-this-taught-me","What this taught me",[11,1768,1769],{},"Some questions you look up, some you measure. \"Does this chunk fit the embedder?\" is a lookup, the same answer for everyone. \"What granularity, and which embedder, work on my notes?\" depends entirely on my data. My first plan treated a measure-question as a lookup.",[11,1771,1772],{},"Scope beats mechanics. Twice, \"what is my biggest chunk?\" resolved to a file that should not be indexed at all. What goes in the corpus moved my results far more than how I sliced it, and that is a judgment about what the system is for, not a number.",[11,1774,1775],{},"The leverage is in the source. Retrieval chunks are derived from how I write notes, and LIBRARIAN edits that same structure directly. Get the notes right, small atomic sections with clean headers, and both jobs get easier at once. The clever chunker I thought I needed is mostly downstream of how I write a single note.",[21,1777,1779],{"id":1778},"scoping-down-to-what-i-actually-retrieve","Scoping down to what I actually retrieve",[11,1781,1782,1783,1788],{},"Mid-puzzle, I found a well-kept external repo, ",[15,1784,1787],{"href":1785,"rel":1786},"https:\u002F\u002Fgithub.com\u002Fartreimus\u002Fnotes-aws-machine-learning",[1683],"artreimus\u002Fnotes-aws-machine-learning",", that already maps the full MLA-C01 syllabus and is actively maintained against current AWS docs. It has no licence, which rules out folding it into my wiki. That turned out to clarify things rather than block them.",[11,1790,1791],{},"Narrowing to gaps means most of what I might have indexed was never mine to index. I measured the concepts-and-entities slice with everything else excluded. 95 files in, 156 out.",[29,1793,1796],{"className":1794,"code":1795,"language":34},[32],"ROOT: ~\u002Fwiki\nincluded files: 95   excluded: 156\ntotal tokens (content only): 92769\n\nFILE-level:    n=95   max=3205  p95=2028  median=879  mean=976\nSECTION-level: n=801  max=832   p95=298   median=91   mean=115\n  oversized  (> 512): 7  (0%)  -> split candidates\n  undersized (\u003C 64):  190 (23%) -> merge candidates\n  sections\u002Ffile: median=8  max=15\n\nbiggest sections:\n   832  wiki\u002Fconcepts\u002FAI Coding Agents in Enterprise.md\n   631  wiki\u002Fconcepts\u002FTokenization Strategies.md\n   619  wiki\u002Fconcepts\u002FLLMOps.md\n   593  wiki\u002Fconcepts\u002FVector Databases and RAG Architecture.md\n   590  wiki\u002Fconcepts\u002FTokenization Strategies.md\n",[36,1797,1795],{"__ignoreMap":38},[11,1799,1800],{},"The scoping paid off cleanly:",[986,1802,1803,1806,1809],{},[989,1804,1805],{},"Tokens fell from 229k to 93k, not from harder chunking but because 156 files never belonged here. More were excluded than included, most of them operational artefacts.",[989,1807,1808],{},"The oversized problem nearly vanished: 48 sections down to 7, the largest a comfortable 832 tokens. The files driving my chunk-size anxiety were transcripts and application notes, not the concept entries I actually retrieve.",[989,1810,1811],{},"The undersized tail held at 23%. That one doesn't bow to scoping. It lives in the concept notes, short stubs that need merging. Mechanical, bounded, and mine to do.",[21,1813,1628],{"id":1627},[11,1815,1816],{},"The merge step. 190 sections under 64 tokens need a look. Some merge cleanly, some are stubs to expand, a few are fine as they are.",[11,1818,1819],{},"After that, the corpus is finally homogeneous. All 95 files are concept entries written to the same schema. The question is whether a clean single-domain index over these entries retrieves well on its own, without routing or clustering. My prediction: it will. If the next round of measurements proves me wrong, that will be interesting too.",{"title":38,"searchDepth":89,"depth":89,"links":1821},[1822,1823,1824,1825,1826,1827,1828],{"id":1671,"depth":89,"text":1672},{"id":1698,"depth":89,"text":1699},{"id":1708,"depth":89,"text":1709},{"id":1742,"depth":89,"text":1743},{"id":1765,"depth":89,"text":1766},{"id":1778,"depth":89,"text":1779},{"id":1627,"depth":89,"text":1628},"2026-06-05","Building a RAG over 250 pages of interview notes, and watching my assumptions die one measurement at a time.",{},{"title":1666,"description":1830},"blog\u002Fmeasuring-my-wiki",[1660,1661,1662],"TnDG6-358fFN-V4RwhsaMl10jgKgiy2EnRdQ-KWVxG8",1781020791121]