BTC
ETH
SOL
BNB
GOLD
XRP
DOGE
ADA
Back to home
Tech

Okapi, or “What if ripgrep Could Edit?”

Okapi combines ripgrep's precise regex searches with direct editing in a text editor.

Okapi combines ripgrep’s precise regex searches with direct editing in a text editor. A developer created it to fix OCR scanning errors across tens of thousands of text files from the US Official Register project. This dataset covers over 100 volumes spanning 150 years of federal employee records, from 1816 to around 1959. Each volume lists names, positions, salaries, birthplaces, and appointments—prime material for historians and genealogists, but riddled with OCR glitches.

The Official Register, published annually by the US government, documents federal workforce details like Richard G’s role as a Painter for the Isthmian Canal Commission at 68¢ per hour in 1909. Poor scan quality from aged paper turns letters into vertical lines. OCR often misreads “Ill” (Illinois) as “III”, or mangles “LaSalle” into “IIIllaSalle”. Examples abound: “RosedaleIII” becomes “RosedaleIll”; “RockIslandArsnlIII” corrects to “RockIslandArsnlIll”. Blind find-replace fails here—it fixes some but hides others or creates new errors.

OCR Challenges in Large-Scale Digitization

Standard Tesseract OCR struggles with these scans. The developer switched to olmOCR, a fine-tuned model that delivers higher accuracy on historical print. Still, scannos persist at scale: tens of thousands of pages mean hundreds of matches per pattern across files. Ripgrep excels at spotting them fast—its Rust implementation searches gigabytes in seconds using regex like III or "Dan[^l ]\b" for likely “Danl”. But ripgrep only finds; it doesn’t edit.

Opening hundreds of files manually wastes hours. Scripting replacements risks over-correction, especially with context like rare “III” for third-generation names. Okapi bridges this: it runs ripgrep under the hood, then loads matching lines into an editable buffer in your text editor. Review context, use multi-select to fix subsets precisely, and save all changes back to originals.

How Okapi Works

Install via Homebrew:

brew install okapi

Basic use searches recursively:

$ okapi III

This pulls lines with “III”, shows 15 characters of context before/after (-c ..15 adjusts), excludes patterns (-e "Michel"), or targets specifics like "Mich\wl". The buffer displays like:

A 76 ▓ — Richd G, IsthCnlCmsn Pntr $0.65p

Vertical bars mark editable sections. In Sublime Text, grab instances with multi-select, type replacements, and hit save. Asciinema demos prove it handles bulk ops smoothly. Vim or Emacs users adapt via stdin/stdout piping—it’s editor-agnostic at core.

Ripgrep’s speed matters: on 10,000+ files totaling gigabytes, it indexes and matches in under a second. Okapi adds minimal overhead. Source on GitHub invites forks for custom buffers or integrations.

Implications for Data Integrity

Accurate digitization unlocks this archive for search. Unfixed scannos poison downstream uses: genealogy sites propagate “RosedaleIII” as a surname; historians misread districts as “IIIpr”. Tools like Okapi scale human review, catching nuances scripts miss. It beats jq or sed pipelines for interactive fixes.

Skepticism warranted: it’s niche, tied to line-based files and regex savvy. No GUI means CLI comfort required. For non-text (PDFs, images), pair with OCR pipelines first. Yet for cleaned extracts like these, it shines. Broader lesson: hybrid search-edit tools prevent garbage-in-garbage-out in AI-era data projects. Expect similar utilities for log analysis or config sweeps—rg precision plus vim power scales to enterprise corpora.

Projects like Official Register highlight public domain goldmines. US gov released scans via HathiTrust; community OCR fills gaps. Okapi accelerates cleanup, ensuring facts like 21st Illinois district hires endure correctly. Download, test on your mess—it might just save your next bulk edit.

March 30, 2026 · 3 min · 8 views · Source: Lobsters

Related