Things you didn’t know about indexes

Database indexes promise lightning-fast lookups, but they come with hidden costs that can tank your app's performance if mishandled.

Database indexes promise lightning-fast lookups, but they come with hidden costs that can tank your app’s performance if mishandled. On a table with 10 million rows, a full table scan might chew through seconds or minutes of CPU and I/O. An index drops that to milliseconds via binary search on a sorted structure. Yet, every insert, update, or delete pays the price by updating the index, bloating storage, and stressing your query planner. Blindly indexing columns leads to write bottlenecks and skyrocketing cloud bills—indexes can double or triple your database size.

Consider a Pokémon table with 1,000 rows across columns like id, name, type_1, generation, and base_attack. Without indexes, finding Pikachu forces the database to scan every row sequentially—a full table scan. Postgres, MySQL, most RDBMS handle this poorly at scale; expect linear time, O(n), where n=10M rows means disaster for user-facing queries.

 id | name       | type_1  | generation | base_attack
----+------------+---------+------------+-------------
  1 | Bulbasaur  | Grass   | 1          | 49
  4 | Charmander | Fire    | 1          | 52
 25 | Pikachu    | Electric| 1          | 55
150 | Mewtwo     | Psychic | 1          | 110

Index the name column, and Postgres builds a B-tree: a self-balancing tree structure sorting names alphabetically. Lookup jumps to O(log n)—on 10M rows, that’s ~23 steps versus 10M. The index points to row IDs, fetching data in one hop. Result: queries fly. But now every Pokémon addition requires inserting into the B-tree, shifting nodes if needed. Multiply by multiple indexes, and writes crawl.

Indexing Trade-offs: Measure Before You Leap

Reads accelerate; writes decelerate. A single index adds ~10-20% overhead per write operation in benchmarks. Stack eight indexes? Writes slow 2-5x, per PostgreSQL tests on TPC-H workloads. Storage balloons: indexes duplicate column data (keys plus pointers), often 30-50% of table size each. On a 100GB table, that’s gigabytes extra, hitting your AWS RDS bill hard—storage costs ~$0.10/GB-month, I/O another $0.20 per million requests.

Cache pressure mounts. Databases preload hot indexes into RAM. One table? Simple. Nine structures? Evictions spike, forcing disk reads. Query planning worsens too: Postgres evaluates plans combinatorially. Five indexes mean dozens of options; planning time can hit 10-50ms on complex queries, dwarfing execution.

Real-world implication: High-write apps like user signups or logs suffer. Netflix saw index bloat cause 30% latency spikes; they pruned ruthlessly. Test with EXPLAIN ANALYZE in Postgres—watch seq scans versus index scans. If writes dominate (e.g., IoT streams), skip indexes on volatile columns like timestamps.

Why Your Index Fails: Pitfalls and Fixes

You indexed name, ran your query—still slow? Common traps abound. First, composite indexes demand prefix order. Index (type_1, generation) speeds WHERE type_1 = 'Grass' AND generation = 1, but not WHERE generation = 1 alone—the planner ignores it without the leftmost type_1. Reverse it, and vice versa. Always query with the prefix.

Selectivity kills indexes. Index a boolean like is_legendary? Half the rows match true/false—stats show low cardinality, so planner picks full scan. Aim for indexes where top values hit <1% of rows. Gender columns? Useless. Use pg_stats or INFORMATION_SCHEMA.STATISTICS to check.

Functions and expressions break them. WHERE UPPER(name) = 'PIKACHU' scans fully—index on UPPER(name) or use LOWER() consistently. LIKE with leading ‘%’? WHERE name LIKE '%chu' forces scan; trailing ‘%’ works. NULLs sideline too—index skips them unless partial indexes.

Postgres MVCC adds bloat: indexes track dead tuples until VACUUM. Neglect maintenance, and sizes explode 2-10x. Run ANALYZE regularly; autovacuum helps but tunes poorly under load.

Why this matters: Bad indexes hide in dev but explode in prod. A fintech app querying trades by user_id? Index it. But log tables by timestamp? Partition instead. Tools like pgBadger or EXPLAIN visualize pain. In cloud, monitor IOPS and CPU—spikes signal index woes. Skeptically audit: drop unused indexes (pg_stat_user_indexes shows zero-scan ones). Net result? 20-50% perf gains, halved storage, leaner ops.

Bottom line: Indexes are tools, not magic. Profile first, index surgically, monitor relentlessly. Scale to billions of rows without melting your infra.

Things you didn’t know about indexes

Indexing Trade-offs: Measure Before You Leap

Why Your Index Fails: Pitfalls and Fixes

Related

Formal typing rules and presentation materials for Swift 6.2’s concurrency type system, focusing on Capability and Region

Rewriting Every Syscall in a Linux Binary at Load Time

Direct Win32 API, Weird-Shaped Windows, and Why They Mostly Disappeared

My PR has been waiting a year, or the exponential curve behind open source backlogs

Retrofitting JIT Compilers into C Interpreters

Modern Common Lisp with FSet