[{"data":1,"prerenderedAt":1663},["ShallowReactive",2],{"post-\u002Fblog\u002Fmerging-my-wiki":3},{"id":4,"title":5,"body":6,"cover":1649,"date":1650,"description":1651,"draft":1652,"extension":1653,"meta":1654,"navigation":435,"path":1655,"seo":1656,"stem":1657,"tags":1658,"__hash__":1662},"blog\u002Fblog\u002Fmerging-my-wiki.md","Merging my wiki, then building the baseline",{"type":7,"value":8,"toc":1635},"minimark",[9,19,24,27,38,42,60,63,336,339,480,689,729,732,736,739,758,771,785,789,856,859,863,866,869,872,875,878,881,885,888,891,895,902,970,973,977,984,1005,1012,1022,1097,1101,1104,1110,1126,1139,1160,1296,1305,1397,1411,1415,1504,1507,1515,1519,1522,1525,1624,1628,1631],[10,11,12,13,18],"p",{},"This picks up where the ",[14,15,17],"a",{"href":16},"\u002Fblog\u002Fmeasuring-my-wiki","previous post"," ended. I had 95 concept-and-entity files, 801 sections, and 190 of them under the 64-token floor. The plan was to merge the short ones.",[20,21,23],"h2",{"id":22},"the-corpus-moved-while-i-was-working","The corpus moved while I was working",[10,25,26],{},"By the time I ran the merge, the wiki had grown. 95 files had become 109, and the section count had climbed from 801 to 954, with the undersized tail holding steady at 21%. All the numbers below are measured against that updated baseline.",[28,29,34],"pre",{"className":30,"code":32,"language":33},[31],"language-text","FILE-level:    n=109  max=4195  p95=2464  median=966  mean=1093\nSECTION-level: n=954  max=832  p95=328  median=98  mean=124\n  oversized  (> 512): 12 (1%)  -> split these\n  undersized (\u003C 64): 208 (21%)  -> merge candidates\n  sections\u002Ffile: median=9  max=15\n","text",[35,36,32],"code",{"__ignoreMap":37},"",[20,39,41],{"id":40},"a-bug-in-the-measurement-script","A bug in the measurement script",[10,43,44,45,48,49,52,53,56,57,59],{},"Before I could trust any of the counts, I needed to fix the script. The header regex was matching ",[35,46,47],{},"#"," anywhere in the file, including inside fenced code blocks. A bash comment like ",[35,50,51],{},"# Start server"," or a YAML snippet was being read as a heading and spun into a phantom section of five to eight tokens. Frontmatter had a related problem: the ",[35,54,55],{},"--- ... ---"," block had no ",[35,58,47],{}," heading of its own, so it became a preamble chunk of 50 to 63 tokens, pure metadata that nothing would ever retrieve.",[10,61,62],{},"The original per-file loop:",[28,64,68],{"className":65,"code":66,"language":67,"meta":37,"style":37},"language-python shiki shiki-themes github-light","file_tokens, sections, sections_per_file = [], [], []\nfor f in files:\n    text = f.read_text(encoding=\"utf-8\", errors=\"ignore\")\n    file_tokens.append(ntok(text))\n    idxs = [m.start() for m in header_re.finditer(text)]\n    bounds = ([0] if (not idxs or idxs[0] > 0) else []) + idxs + [len(text)]\n    secs = [text[bounds[i]:bounds[i+1]] for i in range(len(bounds)-1)]\n    secs = [s for s in secs if s.strip()]\n    sections_per_file.append(len(secs))\n    sections += [(ntok(s), f) for s in secs]\n","python",[35,69,70,86,101,136,142,163,234,279,304,315],{"__ignoreMap":37},[71,72,75,79,83],"span",{"class":73,"line":74},"line",1,[71,76,78],{"class":77},"sgsFI","file_tokens, sections, sections_per_file ",[71,80,82],{"class":81},"sD7c4","=",[71,84,85],{"class":77}," [], [], []\n",[71,87,89,92,95,98],{"class":73,"line":88},2,[71,90,91],{"class":81},"for",[71,93,94],{"class":77}," f ",[71,96,97],{"class":81},"in",[71,99,100],{"class":77}," files:\n",[71,102,104,107,109,112,116,118,122,125,128,130,133],{"class":73,"line":103},3,[71,105,106],{"class":77},"    text ",[71,108,82],{"class":81},[71,110,111],{"class":77}," f.read_text(",[71,113,115],{"class":114},"sqxcx","encoding",[71,117,82],{"class":81},[71,119,121],{"class":120},"sYBdl","\"utf-8\"",[71,123,124],{"class":77},", ",[71,126,127],{"class":114},"errors",[71,129,82],{"class":81},[71,131,132],{"class":120},"\"ignore\"",[71,134,135],{"class":77},")\n",[71,137,139],{"class":73,"line":138},4,[71,140,141],{"class":77},"    file_tokens.append(ntok(text))\n",[71,143,145,148,150,153,155,158,160],{"class":73,"line":144},5,[71,146,147],{"class":77},"    idxs ",[71,149,82],{"class":81},[71,151,152],{"class":77}," [m.start() ",[71,154,91],{"class":81},[71,156,157],{"class":77}," m ",[71,159,97],{"class":81},[71,161,162],{"class":77}," header_re.finditer(text)]\n",[71,164,166,169,171,174,178,181,184,187,190,193,196,199,201,203,206,209,212,215,218,221,223,225,228,231],{"class":73,"line":165},6,[71,167,168],{"class":77},"    bounds ",[71,170,82],{"class":81},[71,172,173],{"class":77}," ([",[71,175,177],{"class":176},"sYu0t","0",[71,179,180],{"class":77},"] ",[71,182,183],{"class":81},"if",[71,185,186],{"class":77}," (",[71,188,189],{"class":81},"not",[71,191,192],{"class":77}," idxs ",[71,194,195],{"class":81},"or",[71,197,198],{"class":77}," idxs[",[71,200,177],{"class":176},[71,202,180],{"class":77},[71,204,205],{"class":81},">",[71,207,208],{"class":176}," 0",[71,210,211],{"class":77},") ",[71,213,214],{"class":81},"else",[71,216,217],{"class":77}," []) ",[71,219,220],{"class":81},"+",[71,222,192],{"class":77},[71,224,220],{"class":81},[71,226,227],{"class":77}," [",[71,229,230],{"class":176},"len",[71,232,233],{"class":77},"(text)]\n",[71,235,237,240,242,245,247,250,253,255,258,260,263,266,268,271,274,276],{"class":73,"line":236},7,[71,238,239],{"class":77},"    secs ",[71,241,82],{"class":81},[71,243,244],{"class":77}," [text[bounds[i]:bounds[i",[71,246,220],{"class":81},[71,248,249],{"class":176},"1",[71,251,252],{"class":77},"]] ",[71,254,91],{"class":81},[71,256,257],{"class":77}," i ",[71,259,97],{"class":81},[71,261,262],{"class":176}," range",[71,264,265],{"class":77},"(",[71,267,230],{"class":176},[71,269,270],{"class":77},"(bounds)",[71,272,273],{"class":81},"-",[71,275,249],{"class":176},[71,277,278],{"class":77},")]\n",[71,280,282,284,286,289,291,294,296,299,301],{"class":73,"line":281},8,[71,283,239],{"class":77},[71,285,82],{"class":81},[71,287,288],{"class":77}," [s ",[71,290,91],{"class":81},[71,292,293],{"class":77}," s ",[71,295,97],{"class":81},[71,297,298],{"class":77}," secs ",[71,300,183],{"class":81},[71,302,303],{"class":77}," s.strip()]\n",[71,305,307,310,312],{"class":73,"line":306},9,[71,308,309],{"class":77},"    sections_per_file.append(",[71,311,230],{"class":176},[71,313,314],{"class":77},"(secs))\n",[71,316,318,321,324,327,329,331,333],{"class":73,"line":317},10,[71,319,320],{"class":77},"    sections ",[71,322,323],{"class":81},"+=",[71,325,326],{"class":77}," [(ntok(s), f) ",[71,328,91],{"class":81},[71,330,293],{"class":77},[71,332,97],{"class":81},[71,334,335],{"class":77}," secs]\n",[10,337,338],{},"The fix adds two helper regexes and splits which string gets scanned from which gets sliced:",[28,340,342],{"className":65,"code":341,"language":67,"meta":37,"style":37},"# Replace fenced code blocks with spaces of the same length so '#' inside\n# bash\u002Fyaml\u002Fetc. snippets are not mistaken for headings, while character\n# positions and token counts stay accurate.\nfence_re = re.compile(r'```.*?```', re.S)\n# Strip YAML frontmatter so it doesn't become a preamble chunk.\nfrontmatter_re = re.compile(r'\\A---\\n.*?\\n---\\n', re.S)\n\ndef _blank_code(text):\n    return fence_re.sub(lambda m: ' ' * len(m.group(0)), text)\n",[35,343,344,350,355,360,388,393,431,437,449],{"__ignoreMap":37},[71,345,346],{"class":73,"line":74},[71,347,349],{"class":348},"sAwPA","# Replace fenced code blocks with spaces of the same length so '#' inside\n",[71,351,352],{"class":73,"line":88},[71,353,354],{"class":348},"# bash\u002Fyaml\u002Fetc. snippets are not mistaken for headings, while character\n",[71,356,357],{"class":73,"line":103},[71,358,359],{"class":348},"# positions and token counts stay accurate.\n",[71,361,362,365,367,370,373,376,379,382,385],{"class":73,"line":138},[71,363,364],{"class":77},"fence_re ",[71,366,82],{"class":81},[71,368,369],{"class":77}," re.compile(",[71,371,372],{"class":81},"r",[71,374,375],{"class":120},"'```",[71,377,378],{"class":176},".",[71,380,381],{"class":81},"*?",[71,383,384],{"class":120},"```'",[71,386,387],{"class":77},", re.S)\n",[71,389,390],{"class":73,"line":144},[71,391,392],{"class":348},"# Strip YAML frontmatter so it doesn't become a preamble chunk.\n",[71,394,395,398,400,402,404,407,410,413,417,419,421,423,425,427,429],{"class":73,"line":165},[71,396,397],{"class":77},"frontmatter_re ",[71,399,82],{"class":81},[71,401,369],{"class":77},[71,403,372],{"class":81},[71,405,406],{"class":120},"'",[71,408,409],{"class":176},"\\A",[71,411,412],{"class":120},"---",[71,414,416],{"class":415},"s691h","\\n",[71,418,378],{"class":176},[71,420,381],{"class":81},[71,422,416],{"class":415},[71,424,412],{"class":120},[71,426,416],{"class":415},[71,428,406],{"class":120},[71,430,387],{"class":77},[71,432,433],{"class":73,"line":236},[71,434,436],{"emptyLinePlaceholder":435},true,"\n",[71,438,439,442,446],{"class":73,"line":281},[71,440,441],{"class":81},"def",[71,443,445],{"class":444},"s7eDp"," _blank_code",[71,447,448],{"class":77},"(text):\n",[71,450,451,454,457,460,463,466,469,472,475,477],{"class":73,"line":306},[71,452,453],{"class":81},"    return",[71,455,456],{"class":77}," fence_re.sub(",[71,458,459],{"class":81},"lambda",[71,461,462],{"class":77}," m: ",[71,464,465],{"class":120},"' '",[71,467,468],{"class":81}," *",[71,470,471],{"class":176}," len",[71,473,474],{"class":77},"(m.group(",[71,476,177],{"class":176},[71,478,479],{"class":77},")), text)\n",[28,481,483],{"className":65,"code":482,"language":67,"meta":37,"style":37},"    raw = f.read_text(encoding=\"utf-8\", errors=\"ignore\")\n    file_tokens.append(ntok(raw))\n    text = frontmatter_re.sub('', raw)\n    text_scan = _blank_code(text)   # code blocks blanked; positions preserved\n    idxs = [m.start() for m in header_re.finditer(text_scan)]\n    bounds = ([0] if (not idxs or idxs[0] > 0) else []) + idxs + [len(text)]\n    secs = [text[bounds[i]:bounds[i+1]] for i in range(len(bounds)-1)]\n    secs = [s for s in secs if s.strip()]\n    sections_per_file.append(len(secs))\n    sections += [(ntok(s), f, _heading(s)) for s in secs]\n",[35,484,485,510,515,530,543,560,610,644,664,672],{"__ignoreMap":37},[71,486,487,490,492,494,496,498,500,502,504,506,508],{"class":73,"line":74},[71,488,489],{"class":77},"    raw ",[71,491,82],{"class":81},[71,493,111],{"class":77},[71,495,115],{"class":114},[71,497,82],{"class":81},[71,499,121],{"class":120},[71,501,124],{"class":77},[71,503,127],{"class":114},[71,505,82],{"class":81},[71,507,132],{"class":120},[71,509,135],{"class":77},[71,511,512],{"class":73,"line":88},[71,513,514],{"class":77},"    file_tokens.append(ntok(raw))\n",[71,516,517,519,521,524,527],{"class":73,"line":103},[71,518,106],{"class":77},[71,520,82],{"class":81},[71,522,523],{"class":77}," frontmatter_re.sub(",[71,525,526],{"class":120},"''",[71,528,529],{"class":77},", raw)\n",[71,531,532,535,537,540],{"class":73,"line":138},[71,533,534],{"class":77},"    text_scan ",[71,536,82],{"class":81},[71,538,539],{"class":77}," _blank_code(text)   ",[71,541,542],{"class":348},"# code blocks blanked; positions preserved\n",[71,544,545,547,549,551,553,555,557],{"class":73,"line":144},[71,546,147],{"class":77},[71,548,82],{"class":81},[71,550,152],{"class":77},[71,552,91],{"class":81},[71,554,157],{"class":77},[71,556,97],{"class":81},[71,558,559],{"class":77}," header_re.finditer(text_scan)]\n",[71,561,562,564,566,568,570,572,574,576,578,580,582,584,586,588,590,592,594,596,598,600,602,604,606,608],{"class":73,"line":165},[71,563,168],{"class":77},[71,565,82],{"class":81},[71,567,173],{"class":77},[71,569,177],{"class":176},[71,571,180],{"class":77},[71,573,183],{"class":81},[71,575,186],{"class":77},[71,577,189],{"class":81},[71,579,192],{"class":77},[71,581,195],{"class":81},[71,583,198],{"class":77},[71,585,177],{"class":176},[71,587,180],{"class":77},[71,589,205],{"class":81},[71,591,208],{"class":176},[71,593,211],{"class":77},[71,595,214],{"class":81},[71,597,217],{"class":77},[71,599,220],{"class":81},[71,601,192],{"class":77},[71,603,220],{"class":81},[71,605,227],{"class":77},[71,607,230],{"class":176},[71,609,233],{"class":77},[71,611,612,614,616,618,620,622,624,626,628,630,632,634,636,638,640,642],{"class":73,"line":236},[71,613,239],{"class":77},[71,615,82],{"class":81},[71,617,244],{"class":77},[71,619,220],{"class":81},[71,621,249],{"class":176},[71,623,252],{"class":77},[71,625,91],{"class":81},[71,627,257],{"class":77},[71,629,97],{"class":81},[71,631,262],{"class":176},[71,633,265],{"class":77},[71,635,230],{"class":176},[71,637,270],{"class":77},[71,639,273],{"class":81},[71,641,249],{"class":176},[71,643,278],{"class":77},[71,645,646,648,650,652,654,656,658,660,662],{"class":73,"line":281},[71,647,239],{"class":77},[71,649,82],{"class":81},[71,651,288],{"class":77},[71,653,91],{"class":81},[71,655,293],{"class":77},[71,657,97],{"class":81},[71,659,298],{"class":77},[71,661,183],{"class":81},[71,663,303],{"class":77},[71,665,666,668,670],{"class":73,"line":306},[71,667,309],{"class":77},[71,669,230],{"class":176},[71,671,314],{"class":77},[71,673,674,676,678,681,683,685,687],{"class":73,"line":317},[71,675,320],{"class":77},[71,677,323],{"class":81},[71,679,680],{"class":77}," [(ntok(s), f, _heading(s)) ",[71,682,91],{"class":81},[71,684,293],{"class":77},[71,686,97],{"class":81},[71,688,335],{"class":77},[10,690,691,692,694,695,698,699,702,703,706,707,709,710,712,713,716,717,720,721,724,725,728],{},"Three things happen here. Frontmatter is stripped from ",[35,693,33],{}," before chunking, but ",[35,696,697],{},"file_tokens"," is still counted from ",[35,700,701],{},"raw",", so file-level totals stay honest. Code blocks are blanked to spaces of equal length rather than deleted: the header regex runs on ",[35,704,705],{},"text_scan",", but each section's content is sliced from the original ",[35,708,33],{},", so a ",[35,711,47],{}," inside a code block can't match as a heading while the section's token count still includes the code it contains. And ",[35,714,715],{},"bounds"," uses ",[35,718,719],{},"len(text)",", not ",[35,722,723],{},"len(text_scan)",", which is safe only because ",[35,726,727],{},"_blank_code"," preserves character count.",[10,730,731],{},"The net effect was about 15 phantom sections disappearing and all frontmatter preamble chunks ceasing to exist.",[20,733,735],{"id":734},"the-manual-merge-work","The manual merge work",[10,737,738],{},"With accurate counts in hand, I went through the undersized sections. Three patterns drove most of the reduction.",[10,740,741,745,746,749,750,753,754,757],{},[742,743,744],"strong",{},"Cross-reference lists."," Around a hundred files carried a ",[35,747,748],{},"## Cross-references"," or ",[35,751,752],{},"## Related pages"," heading over nothing but a list of wiki links. Structurally these are sections; semantically they are navigation. Collapsing them to an inline ",[35,755,756],{},"See also:"," line dissolved 102 sections without losing a single link.",[10,759,760,763,764,124,767,770],{},[742,761,762],{},"Design-pattern facets."," Several pages had given every facet its own heading: ",[35,765,766],{},"When to use",[35,768,769],{},"When NOT to use",", framework equivalents, and cross-references, five to seven thin sections on one page that together make one solid entry. Merging the use-and-avoid headings and folding the equivalents back inline pulled roughly 14 sections into their neighbours.",[10,772,773,776,777,780,781,784],{},[742,774,775],{},"Bare titles."," Many pages went straight from ",[35,778,779],{},"# Title"," into ",[35,782,783],{},"## Core idea"," with nothing between, leaving the title section as pure overhead. A sentence or two of lead-in pushed those above the floor and gave each page an opening line.",[20,786,788],{"id":787},"results","Results",[790,791,792,808],"table",{},[793,794,795],"thead",{},[796,797,798,802,805],"tr",{},[799,800,801],"th",{},"Metric",[799,803,804],{},"Before",[799,806,807],{},"After",[809,810,811,823,834,845],"tbody",{},[796,812,813,817,820],{},[814,815,816],"td",{},"Total sections",[814,818,819],{},"954",[814,821,822],{},"713 (-25%)",[796,824,825,828,831],{},[814,826,827],{},"Undersized (\u003C 64 tok)",[814,829,830],{},"208 (21%)",[814,832,833],{},"111 (15%)",[796,835,836,839,842],{},[814,837,838],{},"Median section size",[814,840,841],{},"98",[814,843,844],{},"126",[796,846,847,850,853],{},[814,848,849],{},"Sections\u002Ffile (median)",[814,851,852],{},"9",[814,854,855],{},"6",[10,857,858],{},"A quarter of the sections were structure, not knowledge.",[20,860,862],{"id":861},"whats-left-and-why-im-not-fixing-it-in-markdown","What's left, and why I'm not fixing it in markdown",[10,864,865],{},"111 sections are still under the floor. They split into two kinds.",[10,867,868],{},"About 40 are entity index cards: short reference pages where every section is naturally thin because the whole page is thin, 30 to 50 lines of fact. Padding them dilutes the signal. They are small by design.",[10,870,871],{},"The other 70 or so are concept titles that landed just under the line, 38 to 63 tokens even after a lead-in. Forcing more text means restating the first paragraph, which helps nothing.",[10,873,874],{},"The clean fix lives in the chunker. The rule is simple: if a section is under the threshold and another section follows it in the same file, concatenate the two before embedding. That folds thin concept titles into the body below them and treats a small reference card as one chunk instead of five thin ones. It is a small change in one place and it decouples how I write from how the text embeds, which is where that decision belongs.",[10,876,877],{},"The alternative is to accept that the 64-token floor was always a rule of thumb. At a median of 126 tokens the corpus is healthy, and a 48-token section is a perfectly good retrieval unit. Drop the floor and a chunk of the residual stops being a problem.",[10,879,880],{},"Either way the lesson holds. A quarter of my section count was structure rather than knowledge, and the cleanest fixes were about what counts as a chunk, not how cleverly I cut one.",[20,882,884],{"id":883},"building-the-baseline","Building the baseline",[10,886,887],{},"Past the merge step, the corpus is finally homogeneous. All the surviving files are concept entries written to the same schema. The thesis: a clean single-domain index over these entries should retrieve well on its own, without routing or clustering.",[10,889,890],{},"The plan: build the dumbest possible baseline, measure recall, and only add complexity if the numbers demand it.",[20,892,894],{"id":893},"the-frozen-corpus","The frozen corpus",[10,896,897,898,901],{},"The wiki kept growing during the merge work, so I froze a snapshot as ",[35,899,900],{},"corpus-v1"," — a JSON manifest with SHA-256 hashes of every source file, treated as immutable for the experiment. Everything downstream points at this fixed snapshot.",[790,903,904,913],{},[793,905,906],{},[796,907,908,910],{},[799,909,801],{},[799,911,912],{},"Value",[809,914,915,923,931,939,947,955,963],{},[796,916,917,920],{},[814,918,919],{},"Source files",[814,921,922],{},"113",[796,924,925,928],{},[814,926,927],{},"Total chunks",[814,929,930],{},"673",[796,932,933,936],{},[814,934,935],{},"Median tokens per chunk",[814,937,938],{},"160",[796,940,941,944],{},[814,942,943],{},"p95 tokens",[814,945,946],{},"383",[796,948,949,952],{},[814,950,951],{},"Max tokens",[814,953,954],{},"568",[796,956,957,960],{},[814,958,959],{},"Oversized (>512 tokens)",[814,961,962],{},"6 (0%)",[796,964,965,968],{},[814,966,967],{},"Undersized (\u003C64 tokens)",[814,969,962],{},[10,971,972],{},"The undersized tail that drove the entire merge effort is gone. Six chunks on each side of the bounds is noise.",[20,974,976],{"id":975},"chunking-strategy","Chunking strategy",[10,978,979,980,983],{},"The chunker implements the merge-small-siblings rule from the previous section. Chunks split on ",[35,981,982],{},"##"," (H2) headers, then three passes clean up the edges:",[985,986,987,994,1000],"ol",{},[988,989,990,993],"li",{},[742,991,992],{},"Split"," on H2 boundaries — each section becomes a candidate chunk",[988,995,996,999],{},[742,997,998],{},"Merge"," consecutive small siblings (\u003C64 tokens each) until hitting a 300-token target",[988,1001,1002,1004],{},[742,1003,992],{}," oversized sections (>512 tokens) by paragraph",[10,1006,1007,1008,1011],{},"Every chunk gets a breadcrumb prefix prepended before embedding, so the model sees context like ",[35,1009,1010],{},"\"Tokenization Strategies › Part 2 — Document tokenization\""," at the start of every vector.",[10,1013,1014,1015,1018,1019,378],{},"Chunk IDs follow the pattern ",[35,1016,1017],{},"{relative_path}##{section_heading}",", giving stable anchors like ",[35,1020,1021],{},"concepts\u002FSageMaker.md##Training jobs",[28,1023,1025],{"className":65,"code":1024,"language":67,"meta":37,"style":37},"chunk_id = f\"{rel_path}##{heading}\"\nbreadcrumb = _breadcrumb(file_stem, h1, heading)\nfull_text = f\"{breadcrumb}\\n\\n{text.strip()}\"\n",[35,1026,1027,1061,1071],{"__ignoreMap":37},[71,1028,1029,1032,1034,1037,1040,1043,1046,1049,1051,1053,1056,1058],{"class":73,"line":74},[71,1030,1031],{"class":77},"chunk_id ",[71,1033,82],{"class":81},[71,1035,1036],{"class":81}," f",[71,1038,1039],{"class":120},"\"",[71,1041,1042],{"class":176},"{",[71,1044,1045],{"class":77},"rel_path",[71,1047,1048],{"class":176},"}",[71,1050,982],{"class":120},[71,1052,1042],{"class":176},[71,1054,1055],{"class":77},"heading",[71,1057,1048],{"class":176},[71,1059,1060],{"class":120},"\"\n",[71,1062,1063,1066,1068],{"class":73,"line":88},[71,1064,1065],{"class":77},"breadcrumb ",[71,1067,82],{"class":81},[71,1069,1070],{"class":77}," _breadcrumb(file_stem, h1, heading)\n",[71,1072,1073,1076,1078,1080,1082,1084,1087,1090,1093,1095],{"class":73,"line":103},[71,1074,1075],{"class":77},"full_text ",[71,1077,82],{"class":81},[71,1079,1036],{"class":81},[71,1081,1039],{"class":120},[71,1083,1042],{"class":176},[71,1085,1086],{"class":77},"breadcrumb",[71,1088,1089],{"class":176},"}\\n\\n{",[71,1091,1092],{"class":77},"text.strip()",[71,1094,1048],{"class":176},[71,1096,1060],{"class":120},[20,1098,1100],{"id":1099},"the-pipeline","The pipeline",[10,1102,1103],{},"Seven steps from frozen corpus to measured recall. Each is one script.",[28,1105,1108],{"className":1106,"code":1107,"language":33},[31],"freeze_corpus.py → chunker.py → schema.sql → bench_embedders.py → pgvector → test_retrieval.py → test_e2e_quality.py\n",[35,1109,1107],{"__ignoreMap":37},[10,1111,1112,1115,1116,1119,1120,1123,1124,378],{},[742,1113,1114],{},"Freeze."," ",[35,1117,1118],{},"tools\u002Ffreeze_corpus.py"," collects all markdown files, computes SHA-256 hashes, writes ",[35,1121,1122],{},"corpus-v1.json",". The repo is tagged ",[35,1125,900],{},[10,1127,1128,1115,1131,1134,1135,1138],{},[742,1129,1130],{},"Chunk.",[35,1132,1133],{},"ingest\u002Fchunker.py"," reads the manifest, chunks every file, writes ",[35,1136,1137],{},"ingest\u002Fchunks.jsonl",". Each line is a JSON object with chunk text, breadcrumb, heading, source path, token count, and stable chunk ID.",[10,1140,1141,1115,1144,1147,1148,1151,1152,1155,1156,1159],{},[742,1142,1143],{},"Schema.",[35,1145,1146],{},"ingest\u002Fschema.sql"," creates the pgvector schema — a ",[35,1149,1150],{},"chunks"," table with a ",[35,1153,1154],{},"vector(768)"," embedding column, an HNSW index for cosine similarity, and a ",[35,1157,1158],{},"search_chunks()"," function. All statements are idempotent.",[28,1161,1165],{"className":1162,"code":1163,"language":1164,"meta":37,"style":37},"language-sql shiki shiki-themes github-light","CREATE TABLE IF NOT EXISTS chunks (\n    chunk_id    TEXT PRIMARY KEY,\n    text        TEXT NOT NULL,\n    breadcrumb  TEXT NOT NULL,\n    embedding   vector(768)\n);\n\nCREATE INDEX chunks_embedding_hnsw ON chunks\n    USING hnsw (embedding vector_cosine_ops)\n    WITH (m = 16, ef_construction = 64);\n","sql",[35,1166,1167,1187,1201,1214,1225,1240,1245,1249,1265,1273],{"__ignoreMap":37},[71,1168,1169,1172,1175,1178,1181,1184],{"class":73,"line":74},[71,1170,1171],{"class":81},"CREATE",[71,1173,1174],{"class":81}," TABLE",[71,1176,1177],{"class":444}," IF",[71,1179,1180],{"class":81}," NOT",[71,1182,1183],{"class":81}," EXISTS",[71,1185,1186],{"class":77}," chunks (\n",[71,1188,1189,1192,1195,1198],{"class":73,"line":88},[71,1190,1191],{"class":77},"    chunk_id    ",[71,1193,1194],{"class":81},"TEXT",[71,1196,1197],{"class":81}," PRIMARY KEY",[71,1199,1200],{"class":77},",\n",[71,1202,1203,1206,1209,1212],{"class":73,"line":103},[71,1204,1205],{"class":81},"    text",[71,1207,1208],{"class":81},"        TEXT",[71,1210,1211],{"class":81}," NOT NULL",[71,1213,1200],{"class":77},[71,1215,1216,1219,1221,1223],{"class":73,"line":138},[71,1217,1218],{"class":77},"    breadcrumb  ",[71,1220,1194],{"class":81},[71,1222,1211],{"class":81},[71,1224,1200],{"class":77},[71,1226,1227,1230,1233,1235,1238],{"class":73,"line":144},[71,1228,1229],{"class":77},"    embedding   ",[71,1231,1232],{"class":81},"vector",[71,1234,265],{"class":77},[71,1236,1237],{"class":176},"768",[71,1239,135],{"class":77},[71,1241,1242],{"class":73,"line":165},[71,1243,1244],{"class":77},");\n",[71,1246,1247],{"class":73,"line":236},[71,1248,436],{"emptyLinePlaceholder":435},[71,1250,1251,1253,1256,1259,1262],{"class":73,"line":281},[71,1252,1171],{"class":81},[71,1254,1255],{"class":81}," INDEX",[71,1257,1258],{"class":444}," chunks_embedding_hnsw",[71,1260,1261],{"class":81}," ON",[71,1263,1264],{"class":77}," chunks\n",[71,1266,1267,1270],{"class":73,"line":306},[71,1268,1269],{"class":81},"    USING",[71,1271,1272],{"class":77}," hnsw (embedding vector_cosine_ops)\n",[71,1274,1275,1278,1281,1283,1286,1289,1291,1294],{"class":73,"line":317},[71,1276,1277],{"class":81},"    WITH",[71,1279,1280],{"class":77}," (m ",[71,1282,82],{"class":81},[71,1284,1285],{"class":176}," 16",[71,1287,1288],{"class":77},", ef_construction ",[71,1290,82],{"class":81},[71,1292,1293],{"class":176}," 64",[71,1295,1244],{"class":77},[10,1297,1298,1115,1301,1304],{},[742,1299,1300],{},"Benchmark and load.",[35,1302,1303],{},"evals\u002Fbench_embedders.py"," runs all candidate models against an eval set of 25 hand-crafted question-to-chunk pairs. For each model it embeds all 673 chunks, embeds all 25 questions, computes recall@k, picks the winner by recall@5 then recall@10 then recall@1 as tiebreaker, and writes the winner's cached vectors straight to pgvector.",[28,1306,1308],{"className":65,"code":1307,"language":67,"meta":37,"style":37},"winner = max(results, key=lambda r: (\n    r[\"recall\"].get(\"@5\", 0),\n    r[\"recall\"].get(\"@10\", 0),\n    r[\"recall\"].get(\"@1\", 0),\n))\nwrite_winner_to_pgvector(winner, chunks, manifest, db_url)\n",[35,1309,1310,1332,1353,1370,1387,1392],{"__ignoreMap":37},[71,1311,1312,1315,1317,1320,1323,1326,1329],{"class":73,"line":74},[71,1313,1314],{"class":77},"winner ",[71,1316,82],{"class":81},[71,1318,1319],{"class":176}," max",[71,1321,1322],{"class":77},"(results, ",[71,1324,1325],{"class":114},"key",[71,1327,1328],{"class":81},"=lambda",[71,1330,1331],{"class":77}," r: (\n",[71,1333,1334,1337,1340,1343,1346,1348,1350],{"class":73,"line":88},[71,1335,1336],{"class":77},"    r[",[71,1338,1339],{"class":120},"\"recall\"",[71,1341,1342],{"class":77},"].get(",[71,1344,1345],{"class":120},"\"@5\"",[71,1347,124],{"class":77},[71,1349,177],{"class":176},[71,1351,1352],{"class":77},"),\n",[71,1354,1355,1357,1359,1361,1364,1366,1368],{"class":73,"line":103},[71,1356,1336],{"class":77},[71,1358,1339],{"class":120},[71,1360,1342],{"class":77},[71,1362,1363],{"class":120},"\"@10\"",[71,1365,124],{"class":77},[71,1367,177],{"class":176},[71,1369,1352],{"class":77},[71,1371,1372,1374,1376,1378,1381,1383,1385],{"class":73,"line":138},[71,1373,1336],{"class":77},[71,1375,1339],{"class":120},[71,1377,1342],{"class":77},[71,1379,1380],{"class":120},"\"@1\"",[71,1382,124],{"class":77},[71,1384,177],{"class":176},[71,1386,1352],{"class":77},[71,1388,1389],{"class":73,"line":144},[71,1390,1391],{"class":77},"))\n",[71,1393,1394],{"class":73,"line":165},[71,1395,1396],{"class":77},"write_winner_to_pgvector(winner, chunks, manifest, db_url)\n",[10,1398,1399,1402,1403,1406,1407,1410],{},[742,1400,1401],{},"Measure."," Two scripts measure quality independently: ",[35,1404,1405],{},"evals\u002Ftest_retrieval.py"," for recall@k and MRR in isolation, and ",[35,1408,1409],{},"evals\u002Ftest_e2e_quality.py"," for end-to-end answer quality using an LLM judge.",[20,1412,1414],{"id":1413},"embedder-results","Embedder results",[790,1416,1417,1439],{},[793,1418,1419],{},[796,1420,1421,1424,1427,1430,1433,1436],{},[799,1422,1423],{},"Model",[799,1425,1426],{},"dim",[799,1428,1429],{},"r@1",[799,1431,1432],{},"r@3",[799,1434,1435],{},"r@5",[799,1437,1438],{},"r@10",[809,1440,1441,1460,1476],{},[796,1442,1443,1446,1449,1452,1455,1458],{},[814,1444,1445],{},"BAAI\u002Fbge-small-en-v1.5",[814,1447,1448],{},"384",[814,1450,1451],{},"0.760",[814,1453,1454],{},"0.960",[814,1456,1457],{},"1.000",[814,1459,1457],{},[796,1461,1462,1465,1467,1470,1472,1474],{},[814,1463,1464],{},"BAAI\u002Fbge-base-en-v1.5",[814,1466,1237],{},[814,1468,1469],{},"0.800",[814,1471,1454],{},[814,1473,1457],{},[814,1475,1457],{},[796,1477,1478,1483,1487,1492,1496,1500],{},[814,1479,1480],{},[742,1481,1482],{},"nomic-ai\u002Fnomic-embed-text-v1.5",[814,1484,1485],{},[742,1486,1237],{},[814,1488,1489],{},[742,1490,1491],{},"0.880",[814,1493,1494],{},[742,1495,1454],{},[814,1497,1498],{},[742,1499,1457],{},[814,1501,1502],{},[742,1503,1457],{},[10,1505,1506],{},"All three models hit recall@5 = 1.000 — every eval question found its correct chunk in the top 5 results. The differentiation is entirely at r@1: how often the single top result is the right one.",[10,1508,1509,186,1512,1514],{},[742,1510,1511],{},"Winner: Nomic",[35,1513,1482],{},"), with recall@1 = 0.880 — 22 out of 25 questions had the correct chunk ranked first.",[20,1516,1518],{"id":1517},"the-verdict","The verdict",[10,1520,1521],{},"The thesis held. A clean single-domain index retrieves perfectly at recall@5 without domain-clustering, reranking, or any of the complexity I assumed I'd need. The clustering idea was a problem I scoped away rather than one I had to engineer around.",[10,1523,1524],{},"The full run:",[28,1526,1530],{"className":1527,"code":1528,"language":1529,"meta":37,"style":37},"language-bash shiki shiki-themes github-light","docker compose up -d\npython3 tools\u002Ffreeze_corpus.py wiki\u002F --tag corpus-v1\npython3 ingest\u002Fchunker.py\npsql \"$DATABASE_URL\" -f ingest\u002Fschema.sql\npython3 evals\u002Fbench_embedders.py --write-to-db\npython3 evals\u002Ftest_retrieval.py --mode pgvector\npython3 evals\u002Ftest_e2e_quality.py --retrieval-mode pgvector\n","bash",[35,1531,1532,1546,1563,1570,1589,1599,1612],{"__ignoreMap":37},[71,1533,1534,1537,1540,1543],{"class":73,"line":74},[71,1535,1536],{"class":444},"docker",[71,1538,1539],{"class":120}," compose",[71,1541,1542],{"class":120}," up",[71,1544,1545],{"class":176}," -d\n",[71,1547,1548,1551,1554,1557,1560],{"class":73,"line":88},[71,1549,1550],{"class":444},"python3",[71,1552,1553],{"class":120}," tools\u002Ffreeze_corpus.py",[71,1555,1556],{"class":120}," wiki\u002F",[71,1558,1559],{"class":176}," --tag",[71,1561,1562],{"class":120}," corpus-v1\n",[71,1564,1565,1567],{"class":73,"line":103},[71,1566,1550],{"class":444},[71,1568,1569],{"class":120}," ingest\u002Fchunker.py\n",[71,1571,1572,1575,1578,1581,1583,1586],{"class":73,"line":138},[71,1573,1574],{"class":444},"psql",[71,1576,1577],{"class":120}," \"",[71,1579,1580],{"class":77},"$DATABASE_URL",[71,1582,1039],{"class":120},[71,1584,1585],{"class":176}," -f",[71,1587,1588],{"class":120}," ingest\u002Fschema.sql\n",[71,1590,1591,1593,1596],{"class":73,"line":144},[71,1592,1550],{"class":444},[71,1594,1595],{"class":120}," evals\u002Fbench_embedders.py",[71,1597,1598],{"class":176}," --write-to-db\n",[71,1600,1601,1603,1606,1609],{"class":73,"line":165},[71,1602,1550],{"class":444},[71,1604,1605],{"class":120}," evals\u002Ftest_retrieval.py",[71,1607,1608],{"class":176}," --mode",[71,1610,1611],{"class":120}," pgvector\n",[71,1613,1614,1616,1619,1622],{"class":73,"line":236},[71,1615,1550],{"class":444},[71,1617,1618],{"class":120}," evals\u002Ftest_e2e_quality.py",[71,1620,1621],{"class":176}," --retrieval-mode",[71,1623,1611],{"class":120},[20,1625,1627],{"id":1626},"whats-next","What's next",[10,1629,1630],{},"With only 25 eval questions, the gap between r@1 = 0.760 and r@1 = 0.880 is literally 3 questions. The natural next step is expanding the eval set from 25 to 500+ using LLM-generated drafts, measuring end-to-end answer quality with the LLM judge, and logging the final numbers as a decision record. If recall holds on the larger eval set, the baseline becomes production.",[1632,1633,1634],"style",{},"html pre.shiki code .sgsFI, html code.shiki .sgsFI{--shiki-default:#24292E}html pre.shiki code .sD7c4, html code.shiki .sD7c4{--shiki-default:#D73A49}html pre.shiki code .sqxcx, html code.shiki .sqxcx{--shiki-default:#E36209}html pre.shiki code .sYBdl, html code.shiki .sYBdl{--shiki-default:#032F62}html pre.shiki code .sYu0t, html code.shiki .sYu0t{--shiki-default:#005CC5}html .default .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html pre.shiki code .sAwPA, html code.shiki .sAwPA{--shiki-default:#6A737D}html pre.shiki code .s691h, html code.shiki .s691h{--shiki-default:#22863A;--shiki-default-font-weight:bold}html pre.shiki code .s7eDp, html code.shiki .s7eDp{--shiki-default:#6F42C1}",{"title":37,"searchDepth":88,"depth":88,"links":1636},[1637,1638,1639,1640,1641,1642,1643,1644,1645,1646,1647,1648],{"id":22,"depth":88,"text":23},{"id":40,"depth":88,"text":41},{"id":734,"depth":88,"text":735},{"id":787,"depth":88,"text":788},{"id":861,"depth":88,"text":862},{"id":883,"depth":88,"text":884},{"id":893,"depth":88,"text":894},{"id":975,"depth":88,"text":976},{"id":1099,"depth":88,"text":1100},{"id":1413,"depth":88,"text":1414},{"id":1517,"depth":88,"text":1518},{"id":1626,"depth":88,"text":1627},null,"2026-06-09","A 25% drop in section count, a measurement bug, a retrieval pipeline that didn't need the clever parts, and recall@5 = 1.000 on the first try.",false,"md",{},"\u002Fblog\u002Fmerging-my-wiki",{"title":5,"description":1651},"blog\u002Fmerging-my-wiki",[1659,1660,1661],"rag","embeddings","chunking","yQR9No1Mt66F93rUDsX49e04WNclBfoUDJDR13nLddU",1781020791590]