Kimbundu.org
I transformed a historical scanned Kimbundu–Portuguese dictionary into a structured lexical corpus through OCR, deterministic parsing, reconstruction, and AI-assisted editorial workflows. That corpus now powers kimbundu.org.
Why this project exists
Kimbundu is one of the major Bantu languages of Angola, spoken by millions of people. Despite that, it's barely documented in digital form. The best lexical references exist as scanned books or old PDFs that no search engine can read and no language tool can use.
I wanted to take one of the most complete surviving Kimbundu–Portuguese dictionaries, digitise it properly, and publish it as a structured, searchable public resource.
This is also personal. I’m of Angolan heritage, and Kimbundu matters to my family and community. As large language models increasingly shape how people find knowledge, languages that are not digitised risk becoming invisible in the systems that will define future access to culture and history.
The problem
The source material is a historical printed dictionary, hundreds of scanned pages in a dense two-column layout. Getting clean structured data out of it is hard:
- –Two-column page layouts that need accurate segmentation before OCR can even start
- –OCR noise from aged print, uneven scanning, and non-Latin diacritics
- –Dense abbreviations, noun-class markers, and grammatical annotations crammed into compact entries
- –Line, column, and page boundary issues that break naive text extraction
- –Every entry needs to carry provenance back to its source page and position
The approach
I designed the pipeline as a series of explicit, auditable stages. Each step produces inspectable outputs, which made it possible to debug OCR issues, reconstruction errors, and editorial decisions without hiding uncertainty inside a single opaque process.
- 1Historical PDF
- 2↓Page extraction
- 3↓Column segmentation
- 4↓OCR capture
- 5↓Deterministic parsing
- 6↓Corpus reconstruction
- 7↓Cleanup
- 8↓Conservative LLM audit
- 9↓Editorial merge
- 10↓Public website dataset
What I built
The pipeline produced:
- –A structured lexical corpus with headwords, grammatical metadata, source-derived definitions, cross-references, and provenance per entry
- –Full provenance tracking, so every entry traces back to its original page, column, and line
- –A conservative AI auditing layer (Ollama and OpenAI) that proposes corrections without overwriting source data, with all suggestions tracked and reviewable
- –An editorial merge workflow that reconciles deterministic parsing with LLM suggestions under human review
- –The public dictionary dataset behind kimbundu.org
- –A Next.js site for searching, browsing, and exploring the full corpus
Results
Why it matters
This isn't just engineering for the sake of it. Languages that haven't been digitised don't show up in the tools people actually use to find information. AI trains on what's already digital. What isn't there won't be reflected.
High-quality digital cultural libraries are infrastructure. They give learners, researchers, and future language tools something reliable to build on, rather than forcing underdocumented languages to remain absent from the systems that increasingly mediate knowledge.
Kimbundu.org is a small but real contribution to that.
What comes next
There's more I want to do with this:
- –A modern Portuguese layer, showing contemporary translations alongside the original archaic ones
- –English and French translations to make the dictionary more widely accessible
- –Grammar resources: verb tables, noun-class guides, usage patterns
- –Cultural content like stories, songs, proverbs, and Bible texts in Kimbundu
- –Eventually, a broader digital cultural library that goes beyond just the dictionary