What's between Kue and Kuh? Finding OCR errors with no OCR

A case study from rebuilding a 1940s Kimbundu→Portuguese dictionary. The punchline: I found and fixed 203 systematic scanning errors across ~10,800 entries using zero machine learning for detection — just the fact that a dictionary is sorted.

The setup

I'm reconstructing a 1940s Kimbundu→Portuguese dictionary from column scans. The pipeline reads each scanned column with a vision model and emits structured JSON — one object per headword, with senses, grammar, noun classes, the lot. ~10,800 entries when stitched together.

Vision models are good at this, but the source fights back. It's letterpress from the 1940s: ink bleed, broken serifs, a typeface where several tall letters look almost identical when the crossbar didn't take. So the output has a long tail of glyph errors — a headword read as Kafanga when the page actually says Katanga. One wrong letter, buried in one of ten thousand entries.

How do you find those at scale? You can't re-read the whole book by eye, and you can't ask the same model that made the mistake to grade its own work. I needed an oracle that was independent of the model.

The dictionary itself turned out to be that oracle.

The seed: one impossible word

It started with a single flagged entry. On page 370, a headword had been read as Ûjúsa, sitting in a run of Uj- words — wedged between the Ue- entries above and the Uh- entries below.

That's impossible. In a sorted dictionary, Uj- comes after Uh-, never before it. A word sitting between Ue and Uh cannot start with Uj — it can only be Uf- or Ug-. So whatever glyph the model read as j was really an f. The whole run was Uf- words; the bold fs had been misread as js.

The fix was forced by position alone. I never needed to look at the scan to know the letter was wrong — and the neighbours even told me what it should be.

That's the whole idea. Let me state it generally:

The core idea: an entry's neighbours pin its slot; a spelling that can't occupy that slot is a located, self-correcting error.

An entry that violates sort order is a located error, and its neighbours tell you the correction. No model, no dictionary of valid words, no training data — just the invariant "the book is sorted."

Building a glyph-agnostic detector

The nice thing about phrasing it as sort-order violation is that it doesn't care which letter broke. f-read-as-j, k-read-as-x, o-read-as-a — they all surface the same way: an entry that sorts where it can't.

First I needed a collation key matching the dictionary's own ordering (Kimbundu sorts prenasalised consonants by their base letter — Nd→d, Ng→g — and ignores diacritics):

def ck(lemma):
    s = unicodedata.normalize("NFD", lemma).encode("ascii", "ignore").decode().lower()
    s = re.sub(r"[^a-z]", "", s)
    s = re.sub(r"^m(?=[bpv])", "", s)        # Mb/Mp/Mv → b/p/v
    s = re.sub(r"^n(?=[dgjzkt])", "", s)     # Nd/Ng/Nj… → d/g/j…
    return s

Then: flatten every headword in reading order, compute its key, and find the entries that break monotonicity. My first attempt picked one longest non-decreasing subsequence (LNDS) and flagged everything off it. That was a mistake — there are many equally-long subsequences, and the reconstruction made unstable choices, blaming whole blocks of perfectly-good words because one later neighbour happened to be out of place. 873 flags, most of them noise.

The fix is to make the test canonical instead of picking one arbitrary backbone. An entry lies on some longest subsequence iff f[i] + g[i] - 1 == L, where f[i] is the longest non-decreasing run ending at i, g[i] the longest starting at i, and L the global max. An entry is a true anomaly only if it can sit on no valid ordering:

#  f[i] = longest non-decreasing run of keys ENDING at i
#  g[i] = longest non-decreasing run STARTING at i
#  i is an anomaly  ⟺  f[i] + g[i] - 1 < L      (it fits on no longest spine)
L = max(f)
anomalies = [i for i in range(n) if f[i] + g[i] - 1 < L]

That one change dropped the noise from 873 to a focused set and stopped it from blaming the wrong side of a swap. Still zero model calls — it's pure combinatorics over the sort key.

From noise to signal

Even canonical, the detector flags more than just glyph errors, and being honest about what each flag is mattered as much as finding them. Three buckets:

Real glyph corruptions — Kafanga for Katanga. The prize.
"Blamed legit blocks" — a single real word floats out of order (a loanword, or the 1940s editor's own slip), and the detector flags the in-order block it landed among. Jinguba ("peanut") is a perfectly good word; it just sorts oddly.
Collation edge cases — a prenasalised form my key didn't fold, the source's own imperfect ordering, addenda pages that aren't alphabetical at all.

The tell for a real glyph error is that it implies a single believable letter-swap deep inside an otherwise-sorted run. So for each flagged entry I substituted each candidate letter and re-tested the whole word against its neighbour bracket. Usually exactly one letter makes it fit — and that's the correction, derived, not guessed:

# Kukofama, wedged in the Kukos…Kukot bracket. Try each letter, keep what fits:
#   s → kukosama   sorts BEFORE kukoso   ✗
#   t → kukotama   sorts inside bracket  ✓   → unique answer: Kukotama
valid = [c for c in FAMILY if low <= key[:pos] + c + key[pos+1:] <= high]
if len(valid) == 1:
    correction = substitute(lemma, pos, valid[0])

Group the survivors by which swap they imply, and a pattern jumps out: dozens of entries, all over the book, all implying the same f→t or f→j. That's not coincidence. That's a typeface defect.

Verifying without re-reading the book

Logic says the letter is wrong; it can't by itself tell a glyph misread (relabel it) from a genuinely misplaced word (leave it, move it later). For that you need the scan — but only a sample, and the page hands you two beautiful free oracles.

The running head. Every page prints its section in the top margin. The page below screams RIK — so every headword on it is Rik-, full stop. The entry my model called Ríxala is Ríkala; the k was read as x.

Page running head RIK confirms the whole column is Rik-; the entry read as Ríxala is Ríkala, and the cross-reference V. pl. mákala carries the k.

The noun-class plural. Bantu languages mark plurals with a class prefix, and the dictionary cross-references them. Ritéle cross-refs V. pl. matelu; Ritatamena cross-refs V. pl. matatamena. The plural carries the same root consonant — so the cross-reference confirms the letter independently. The entry effectively checks its own spelling.

The Katanga case: f→t confirmed both by position (between Kátandu and Katangu) and by the entry's own cross-reference V. kitangana.

A dozen scan checks across the book were enough to confirm the patterns held, then the deterministic resolver applied the rest.

The twist: there were two

Because the detector is glyph-agnostic, I didn't have to guess which confusions existed — I just read off the substitutions it produced. The dominant one was the long-stem family: a bold f with a faint crossbar read as j, t, or s.

But a second, completely separate family fell out of the same run: k read as x. In this typeface a worn k's diagonal limbs collapse into an x. Mukóze, Makunde, Mikondo, Okoto, Xikita — all k words, all read as x. The page below is the Muk- section; every "x" in it is really a k:

The Muk- section: Mukosa, Mukóze, Mukoto, Mukoue — all k, every one read as x by the model.

One of these, Muxirikiri, had earlier been written off by a different heuristic as a "misplaced entry." It was never misplaced — it was Mukirikiri the whole time. The sort-order lens reclassified a phantom into a real, fixable error.

A second family: a bold f read as j (Mujúndi → Mufúndi), no descender or dot actually present in the print.

Discipline: the false positive, and the honesty

The method has a sharp edge: if a flagged word is genuinely a real word that merely sorts oddly, the deterministic fix will happily "correct" something that was right. It bit me once. Kunjuzá — a Portuguese loanword ("confusão") — got rewritten to Kunfuzá before I caught it. The fix: screen singletons for loanword markers (port., lat.…) and trust runs (several consecutive entries with the identical swap — a systematic band misread) more than lone entries. One revert, one screen, and the edge was contained.

And the part it's tempting to skip: most flags are not errors. Of ~495 remaining, only ~80 are a genuine glyph tail (multi-letter corruptions, s-vs-t ambiguity — these need per-entry vision). The other ~415 are benign: blamed-legit-blocks, the source's own ordering slips, a prenasal my key doesn't fold. Reporting "495 problems" would have been a lie; the honest number of systematic corruptions was two families and ~290 entries, and I fixed what I could prove.

Results

Glyph corrections applied & verified	203 (+91 from the seed runs)
Distinct typeface confusion families found	2 — f·t·j and k·x
Sort-order violation flags	696 → 495
Adjacent out-of-order headwords	222 → 161
ML tokens spent on detection	0
False positives caught & reverted	1 (Kunjuzá)

Every correction is logged as a reversible ocr_correction with its basis, so nothing is a black box and any change can be undone. The entry carries its own provenance — here the sort-order fix sits on top of an earlier raw-OCR cleanup, each step recorded:

{
  "lemma": "Katanga",
  "ocr_corrections": [
    { "from": "paninhc", "to": "paninho" },
    { "from": "Kafanga", "to": "Katanga", "basis": "alpha-impossibility f->t" }
  ]
}

The transferable lesson

The reason this worked has nothing to do with dictionaries. It's that the data had a cheap, model-independent invariant — it's sorted — and an invariant is a free test oracle. When you can state a property your data must hold, every violation is a located bug, and often the property tells you the fix, deterministically, for nothing.

Most datasets have one of these hiding in them:

a ledger whose columns must sum to the total,
events whose timestamps must be monotonic,
a tree whose children must reference a real parent,
percentages that must hit 100, IDs that must be unique, foreign keys that must resolve.

Before reaching for a model to find errors in structured data, ask what the data already promises about itself. The invariant is cheaper than the model, it's deterministic, it's explainable — and it doesn't grade its own homework.

This is a companion deep-dive to Building a Digital Kimbundu Dictionary in the Age of AI, which covers the full corpus-engineering pipeline.

Appendix: the tools

All deterministic, no model calls until the scan-verification sample (scripts/rebuild/):

detect_alpha_impossibilities.py — the f[i]+g[i]-1 < L detector. The workhorse.
resolve_family.py <letters> — substitute-and-re-test for any glyph pair; emits unique corrections.
apply_family.py <fam> — applies them, screening lone entries for loanword false positives.
detect_section_disorder.py — a coarser column-level cousin (LNDS over per-column median keys) for catching whole misfiled columns.