Projects

Things I've built, or am still building.

Next.jsTypeScriptPythonTesseract OCROllamaOpenAI APITailwind CSSVercelJSON Corpus Engineering

Built a digital language-preservation platform for Kimbundu by transforming a historical scanned Kimbundu–Portuguese dictionary into a structured, auditable lexical corpus. Designed a multi-stage pipeline covering PDF page extraction, column segmentation, OCR capture, deterministic parsing, corpus reconstruction, conservative LLM auditing, and editorial merge workflows. Produced a final merged corpus of 10,679 entries with provenance and review tracking, then published a website-ready public dataset powering kimbundu.org.

  • Built a multi-stage OCR → corpus reconstruction pipeline for a historical dictionary spanning hundreds of scanned pages
  • Produced a final merged editorial corpus of 10,679 entries with provenance, cleanup metadata, and review workflows
  • Published a public dictionary dataset and website experience to support Kimbundu language preservation
Read case study →