Building a Digital Kimbundu Dictionary in the Age of AI

The project started with a simple question

What happens to a language when its best reference materials only exist as scanned books?

That question has been sitting with me for a long time.

Kimbundu is one of the major Bantu languages of Angola. It is not a small language, and it is not culturally insignificant. But if you go looking for serious digital resources, the landscape gets thin fast. The most substantial references live in old printed works, scanned PDFs, and fragments of knowledge that are difficult to search, hard to reuse, and invisible to modern software.

One of those references is a historical Kimbundu–Portuguese dictionary. I decided to take it and turn it into something machines and people could both use: a structured, searchable lexical corpus that could power a public dictionary website and, over time, a wider digital cultural library.

That is how kimbundu.org began.

This was never really "just an OCR project"

At the beginning, it is tempting to think a project like this is mostly about OCR. It is not. OCR is the first capture layer. The real problem is structure.

Historical dictionaries are dense, compact, and full of compressed meaning. This source material is printed in a two-column layout, packed with abbreviations, noun-class markers, cross-references, grammatical notes, and weak separators that are obvious to a human reader but ambiguous to a machine. The scan itself introduces its own problems: dust, diacritics, broken ligatures, header bleed, line-wrap artefacts, and column-boundary confusion.

A direct "PDF to text" workflow would have produced something that looked complete while quietly hiding structural errors everywhere.

So I made an early decision that shaped the whole project:

Do not treat the dictionary as a single OCR task. Treat it as a staged corpus-engineering problem.

The pipeline I built

The workflow eventually became a twelve-stage pipeline: from PDF acquisition and page rendering, through column segmentation and OCR capture, into deterministic parsing, chunked extraction, and corpus consolidation, then reconstruction across line, column, page, and chunk boundaries, cleanup into a Portuguese-first lexical dataset, a conservative LLM audit, editorial merge, and finally a publication layer for the website.

That architecture now lives explicitly in the project documentation, rather than as tribal knowledge in scripts and terminal history.

What mattered most was that every stage stayed inspectable. Raw OCR is preserved. Page and column provenance are retained. Cleanup is separated from semantic enrichment. Audit output is advisory, never destructive. Editorial changes are batched and merged explicitly.

The corpus is digital, and it is traceable.

Why deterministic parsing mattered

Large language models are powerful, but they are not a substitute for structure.

The core parsing and reconstruction pipeline had to be deterministic. I did not want the source of truth to shift every time I reran a script. The system needed to produce stable outputs, preserve evidence, and make it obvious where uncertainty remained.

So the main workflow stayed deliberately conservative:

raw OCR stayed raw
parsing extracted only what could be supported structurally
reconstruction repaired fused entries before semantic cleanup
cleanup produced a Portuguese-first lexical dataset
LLMs were used later as auditors, not silent editors

The most useful role for the model was not "rewrite the dictionary for me." It was:

"Flag suspicious entries, classify likely issues, and suggest where human attention is warranted."

That distinction kept the project honest.

What the system actually produced

The end state is a layered corpus with multiple explicit outputs: a cleaned corpus, an audited corpus, an editorial working corpus, a final merged version, a slim public corpus, and a site publication bundle.

After editorial recovery, the final merged corpus contains 10,679 entries, including 376 approved editorial resolutions, 58 entries recovered during editorial work, and 28 dropped false or redundant fragments.

On the website side, the current application serves that public corpus in a dictionary-first interface with landing/search, word pages, alphabetical browse, noun-class pages, and multilingual routing. The project is already a usable public resource, not just a research artefact.

The hardest part was structural recovery

Historical dictionaries do not stop cleanly at machine-friendly boundaries. Entries bleed into each other across line wraps. A page heading can leak into lexical content. A cross-reference can look like a new definition. A broken scan can make one entry look like three, or three look like one.

Some of the most useful work in the pipeline involved reconstructing entries across line, column, page, and chunk boundaries. This is where the project stopped being "document parsing" and became something closer to historical corpus reconstruction.

I now think of the project less as a dictionary site and more as the foundation of a digital cultural archive.

Why this matters in the age of LMs

There is a broader reason I care about this work.

We are entering a period where more and more people will not search the web in the traditional sense. They will ask language models. They will ask assistants. They will rely on generated summaries, synthetic answers, and systems that compress the world into a handful of probable responses.

Those systems will only know what has been made digitally available to them.

If a language is poorly digitised, its representation in future AI systems will be thin, distorted, or absent. If a culture's best sources remain trapped in scans, inaccessible archives, or hard-to-parse books, future systems will reflect that absence. The problem is not technical. It is civilisational.

A high-quality digital cultural library is infrastructure. It gives learners something trustworthy to build on, researchers something structured to analyse, future tools something real to retrieve, and a community something that preserves voice, memory, and meaning in machine-readable form.

I do not want my children, twenty years from now, to inherit a future where their access to Angolan culture is mediated mainly through whatever random scraps Silicon Valley happened to ingest. I want them to have access to real archives, real voices, and materials built with care and proximity to the people and histories involved.

Why kimbundu.org is only the beginning

The dictionary matters, but it is not the endpoint. The broader goal is a genuine digital Kimbundu cultural library.

That means, over time:

a modernised Portuguese display layer alongside the historical source text
English and French translation layers
grammar resources and noun-class guides
Kimbundu Bible texts and audio
stories, proverbs, songs, and oral materials
linked references between dictionary entries and broader learning resources

The website already reflects that direction: the app is intentionally dictionary-first, but clearly positioned as the beginning of a wider language-preservation platform. That pacing is deliberate. A strong archive should be built on a stable foundation.

What I learned building it

Good intermediate artefacts are worth the effort

Debug images, page-level JSON, chunk summaries, corpus reports, audit summaries, editorial manifests -- all of it felt like overhead at first. It wasn't. It was the reason the project remained debuggable as complexity grew.

Reviewability is a feature

When you work with historical material, being able to explain why a transformation happened matters almost as much as the transformation itself.

LLMs are best used carefully

The most valuable use of language models here was constrained issue detection and conservative audit suggestion, not freeform rewriting. That boundary protected the corpus from becoming a moving target.

Structure comes before enrichment

It is tempting to jump into translation, glossing, or more visible product features. But once the underlying structure is unstable, everything on top of it becomes expensive to trust.

Where I want to take it next

The next phase is about making the corpus more useful without losing fidelity:

modern Portuguese display forms alongside the original dictionary Portuguese
future Kimbundu standardisation support as orthographic guidance becomes clearer
better educational presentation of noun classes and grammar
deeper links between lexical entries and texts, songs, and stories
eventually, richer language tools built on top of a stable, inspectable archive

The dictionary was the first hard problem. It will not be the last.

Final thought

The most important result of this project is not that a historical dictionary became searchable. It is that the path from scan to structured cultural resource is now explicit, auditable, and reusable.

Preservation is not just about keeping artefacts alive. It is about making them legible to the systems that shape future access to knowledge. And increasingly, those systems are no longer shelves or search engines. They are models.

If we want languages like Kimbundu to remain visible in that future, we need to build the libraries now.