BTC
ETH
SOL
BNB
GOLD
XRP
DOGE
ADA
Back to home
Tech

Applying “Programming Without Pointers” to an mbox indexer using Zig

Zig lets you process a 5GB mbox file—tens of thousands of emails—without a single dynamic memory allocation.

Zig lets you process a 5GB mbox file—tens of thousands of emails—without a single dynamic memory allocation. I applied Andrew Kelley’s “Programming Without Pointers” (PWP) technique to build an indexer that extracts Message-IDs from Hey’s export, flags duplicates against a prior Gmail import, and skips them. This avoided an 8-hour reimport via GYB, which users report creates dupes due to propagation delays and incomplete filtering.

The migration exposed Hey’s limits: no API, poor integrations. Gmail offers full API access and ecosystem tools. Hey’s 5GB mbox dump arrived as plain text, a format dating to 1970s Unix but still used by Google Takeout. Each message starts with a “From ” line (careful: escaped in bodies), followed by headers and body until the next “From ” or EOF. Attachments embed as base64, but for deduping, only Message-ID matters—a unique header per RFC 5322.

Why Avoid Allocations?

Allocations kill performance on large files. They fragment memory, cause cache misses, and in GC languages like Go or Rust’s stdlib, trigger pauses. Kelley’s PWP talk (watch it: 30 minutes on YouTube) shows how Zig’s comptime and slices let you treat input as a single buffer. Parse by finding offsets—no copies, no pointers beyond slices. For 5GB, loading into RAM (feasible on 16GB+ machines) yields O(n) time, dominated by I/O and string searches.

Skeptical note: PWP adds upfront code complexity. Zig’s type system forces you to own buffers explicitly, unlike C’s wild pointers or Rust’s lifetimes. But patterns emerge: load file to []const u8, find “From ” delimiters via mem.indexOf, slice headers, hunt “Message-ID:” with linear scans or Boyer-Moore for speed. Once fluent, it’s simpler than chasing leaks or GC tuning.

Zig Implementation Breakdown

Start with std.mem.readFileAlloc or mmap for the arena. Use a fixed arena allocator only if needed (here, none). Core loop:

const std = @import("std");

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();
    const allocator = gpa.allocator();

    const file = try std.fs.cwd().openFile("export.mbox", .{});
    defer file.close();
    const content = try file.readToEndAlloc(allocator, 5 * 1024 * 1024 * 1024); // 5GB cap
    defer allocator.free(content);

    var it = MboxIterator.init(content);
    var seen = std.AutoHashMap([]const u8, void).init(allocator);
    defer seen.deinit();

    while (it.next()) |msg| {
        if (seen.contains(msg.message_id)) continue;
        try seen.put(msg.message_id, {});
        // Process or log
    }
}

MboxIterator struct holds the buffer and cursor. next() scans for “From ” (memcmp or indexOfPos), slices to end, parses headers by ‘:’ delimiter. Extract Message-ID: from pos = mem.indexOf(u8, slice, “Message-ID:”) orelse continue; then trim to next \r\n. No allocs in hot path—Message-ID slices reference the buffer. Hashmap keys slice directly (Zig allows, with care for lifetime).

Edge cases: “From ” in bodies (RFC 4155 mandates > escaping, but verify). Multi-GB files stress I/O; Zig’s async I/O shines here, but sync sufficed. Ran on M1 Mac: parsed 5GB in 45 seconds, found 2.1% dupes (propagation lag). GYB’s Java heap thrashing took 8 hours for upload.

Implications for Data-Heavy Tasks

This matters beyond email. Mbox-like formats haunt logs, blockchain dumps, packet captures. PWP scales to TBs on SSDs: zero-copy parsing feeds indexes or dedup sets. In crypto/security, apply to chainstate diffs (Bitcoin’s 500GB+), threat intel feeds. Finance: trade logs, without alloc storms.

Fair critique: Zig’s young ecosystem lacks battle-tested libs (no production mbox parser yet). Manual parsing risks bugs—use fuzzing (zig fuzz). But for one-offs, it crushes scripts in Python (allocs galore) or even Rust (borrowchk fights). Tradeoff: 200 LOC vs. 20 with libs, but 100x speed/memory win.

Bottom line: PWP proves Zig for perf-critical data crunching. Watch the talk, clone the repo if I publish, tweak for your 10GB dumps. Migrations hurt less when you control the stack.

April 8, 2026 · 4 min · 12 views · Source: Lobsters

Related