BTC
ETH
SOL
BNB
GOLD
XRP
DOGE
ADA
Back to home
Tech

Show HN: Postgres extension for BM25 relevance-ranked full-text search

Tiger Data (closely tied to Timescale) just released pg_textsearch v1.0, a Postgres extension that implements BM25 full-text search with relevance ranking.

Tiger Data (closely tied to Timescale) just released pg_textsearch v1.0, a Postgres extension that implements BM25 full-text search with relevance ranking. It runs natively in Postgres under the permissive Postgres license, dodging the AGPL restrictions of the leading alternative, ParadeDB. Benchmarks on the MS-MARCO dataset show it delivering 4.7x higher query throughput at scale compared to ParadeDB’s Tantivy backend.

This matters because Postgres users building AI workloads—especially retrieval-augmented generation (RAG) pipelines—need hybrid search: keyword matching via BM25 combined with semantic search from pgvector. Core Postgres full-text search lacks proper ranking and scales poorly for large indexes. ParadeDB fills that gap but ties you to AGPL, complicating commercial use. pg_textsearch removes that hurdle, and its performance edge suggests it’s not just viable but superior for high-throughput scenarios.

Background and Development

Tiger Data specializes in Postgres for time-series data via TimescaleDB. Last summer, they pushed into AI-centric apps, building pgvectorscale to extend pgvector beyond RAM limits for vector search. They needed a matching lexical search component. Instead of licensing ParadeDB or spinning up Elasticsearch, lead developer TJ Saunders—with 25 years in database internals—built pg_textsearch mostly solo.

Initial estimate: one quarter using AI tools like Claude and Opus. Reality: two quarters, boosted by community contributions after early open-sourcing. The result integrates Tantivy-like BM25 directly into Postgres, handling tokenization, indexing, and ranked queries without external dependencies. Source code, benchmarks, and setup scripts sit on GitHub. Full details in their blog post.

BM25, for context, is the gold standard in information retrieval: a probabilistic model that ranks documents by term frequency, inverse document frequency, and length normalization. It outperforms simpler TF-IDF and powers tools like Elasticsearch and Solr. Embedding it in Postgres lets you query hybrid indexes with SQL like:

SELECT id, ts_rank_cd(textsearch, query) AS rank
FROM documents
WHERE textsearch @@ websearch_to_tsquery('english', query)
ORDER BY rank DESC
LIMIT 10;

pg_textsearch extends this with scalable BM25 scoring, supporting multi-language tokenizers and dynamic updates.

Benchmarks and Performance Claims

They tested on MS-MARCO, a 8.8 million passage dataset popular for passage retrieval benchmarks. Setup: 16-core AWS r6i instances, 100 concurrent queries, index sizes up to 1 million documents.

Results: pg_textsearch hit 4.7x the throughput of ParadeDB/Tantivy at equivalent latency (p95 under 100ms). Recall@10 stayed competitive, around 0.85-0.90 across configs. Smaller indexes showed 2-3x gains; at scale, the gap widened due to pg_textsearch’s Postgres-native optimizations, like leveraging just-in-time compilation and shared buffers.

Skeptical take: Self-reported numbers, but they open-sourced the scripts (benchmarks/ in the repo). Reproduce them yourself—MS-MARCO is public, and Docker setups simplify it. Early signs look solid; community feedback post-release will confirm. If it holds, it challenges ParadeDB’s edge, which relies on Rust’s Tantivy for speed but adds extension overhead.

Implications for Postgres Users

Postgres now hosts a complete, scalable hybrid search stack: pgvector + pgvectorscale for embeddings, pg_textsearch for keywords. No need for Lucene derivatives, separate vector DBs like Pinecone, or AGPL compromises. Deployment stays simple—compile the extension, load into your cluster.

For AI devs: RAG pipelines get cheaper and faster. Index millions of docs, query with late fusion (re-rank BM25 + cosine similarity), all in SQL. Costs drop versus multi-tool stacks; Timescale Cloud offers it hosted immediately.

Broader impact: AI dev tools slashed build time from 6-12 months to two quarters. Technical moats erode fast—ParadeDB’s first-mover advantage shrinks. Watch for hyperscalers like Supabase or Neon to bundle it, standardizing Postgres as a vector+search powerhouse.

Downsides? Still maturing; lacks some ParadeDB features like zonemaps or advanced analyzers. Test your workload. But at v1.0 with beating benchmarks, it’s production-ready for many. Grab the repo and benchmark it—your next RAG app might run entirely in Postgres.

April 1, 2026 · 4 min · 7 views · Source: Hacker News

Related