Kimbundu.org

I transformed a historical scanned Kimbundu–Portuguese dictionary into a structured lexical corpus through OCR, deterministic parsing, reconstruction, and AI-assisted editorial workflows. That corpus now powers kimbundu.org.

Why this project exists

Kimbundu is one of the major Bantu languages of Angola, spoken by millions of people. Despite that, it's barely documented in digital form. The best lexical references exist as scanned books or old PDFs that no search engine can read and no language tool can use.

I wanted to take one of the most complete surviving Kimbundu–Portuguese dictionaries, digitise it properly, and publish it as a structured, searchable public resource.

This is also personal. I'm of Angolan heritage, and Kimbundu matters to my family and community. As large language models increasingly shape how people find knowledge, languages that are not digitised risk becoming invisible in the systems that will define future access to culture and history.

The problem

The source material is a historical printed dictionary, hundreds of scanned pages in a dense two-column layout. Getting clean structured data out of it is hard:

Two-column page layouts that need accurate segmentation before OCR can even start
OCR noise from aged print, uneven scanning, and non-Latin diacritics
Dense abbreviations, noun-class markers, and grammatical annotations crammed into compact entries
Line, column, and page boundary issues that break naive text extraction
Every entry needs to carry provenance back to its source page and position

The approach

I designed the pipeline as a series of explicit, auditable stages. Each step produces inspectable outputs, which made it possible to debug OCR issues, reconstruction errors, and editorial decisions without hiding uncertainty inside a single opaque process.

1Historical PDF
2↓Page extraction
3↓Column segmentation
4↓OCR capture
5↓Deterministic parsing
6↓Corpus reconstruction
7↓Cleanup
8↓Conservative LLM audit
9↓Editorial merge
10↓Public website dataset

What I built

The pipeline produced:

A structured lexical corpus with headwords, grammatical metadata, source-derived definitions, cross-references, and provenance per entry
Full provenance tracking, so every entry traces back to its original page, column, and line
A conservative AI auditing layer (Ollama and OpenAI) that proposes corrections without overwriting source data, with all suggestions tracked and reviewable
An editorial merge workflow that reconciles deterministic parsing with LLM suggestions under human review
The public dictionary dataset behind kimbundu.org
A Next.js site for searching, browsing, and exploring the full corpus

Results

10,679

Final corpus entries

370+

Scanned pages processed

Provenance

Tracked to page and column

Why it matters

This isn't just engineering for the sake of it. Languages that haven't been digitised don't show up in the tools people actually use to find information. AI trains on what's already digital. What isn't there won't be reflected.

High-quality digital cultural libraries are infrastructure. They give learners, researchers, and future language tools something reliable to build on, rather than forcing underdocumented languages to remain absent from the systems that increasingly mediate knowledge.

Kimbundu.org is a small but real contribution to that.

What comes next

There's more I want to do with this:

A modern Portuguese layer, showing contemporary translations alongside the original archaic ones
English and French translations to make the dictionary more widely accessible
Grammar resources: verb tables, noun-class guides, usage patterns
Cultural content like stories, songs, proverbs, and Bible texts in Kimbundu
Eventually, a broader digital cultural library that goes beyond just the dictionary