Skip to content

fix(postgres): right-size pools & bound brainztableinator concurrency to fit shared PgBouncer cap#396

Merged
SimplicityGuy merged 1 commit into
mainfrom
worktree-fix+pg-pool-exhaustion
Jun 21, 2026
Merged

fix(postgres): right-size pools & bound brainztableinator concurrency to fit shared PgBouncer cap#396
SimplicityGuy merged 1 commit into
mainfrom
worktree-fix+pg-pool-exhaustion

Conversation

@SimplicityGuy

Copy link
Copy Markdown
Owner

Problem

In production, all long-lived services connect to a shared PostgreSQL through a PgBouncer pooler in session-pooling mode (POSTGRES_HOST=pgbouncer:6432). In session mode every client connection pins a dedicated Postgres backend for its whole lifetime against a hard per-database cap of 45. During a MusicBrainz bulk import, brainztableinator churns constantly on ⚠️ Connection pool exhausted (attempt 1/5)… while PgBouncer sits at 45/45 with ~18 clients queued and Postgres shows ~29 backends idle in transaction. The app collectively wants ~63 connections against the 45 cap.

Full write-up: docs/postgres-pool-exhaustion-analysis.md.

Root causes (with evidence)

  1. Uncoordinated, oversized pool maxima. Sum of service pool max was 115 (api 10 + tableinator 50 + brainztableinator 50 + insights 5) — 2.5× the cap. brainztableinator's max=50 alone exceeds the 45 cap. Sizes were copy-pasted "to match prefetch", not budgeted against the shared backend pool.
  2. No concurrency bound in brainztableinator. One transaction per message, with prefetch_count=200 × 4 consumers = up to 800 in-flight handlers, each grabbing a pooled connection — permanently driving the pool to its ceiling. (tableinator never exhausts despite the same max=50 because its BatchProcessor semaphore caps concurrent flushes at 2.)
  3. Per-row child inserts widen the idle-in-transaction window. Each relationship/external-link was a separate INSERT in a Python loop inside one open transaction — N+M sequential round-trips pinning the backend between statements.

Fixes (app-side — the app was over-pooling)

  • Budget-aware pool sizing via resolve_postgres_pool_sizes() in common/config.py, with per-service defaults (api 2/8, tableinator 2/12, brainztableinator 2/12, insights 1/4) and shared POSTGRES_POOL_MIN_SIZE / POSTGRES_POOL_MAX_SIZE overrides. New sum of maxima ≈ 36 ≤ 45; idle footprint drops from 13 → 8.
  • Couple brainztableinator prefetch to pool capacity — channel-global QoS (global_=True) with prefetch_count = pool max (_channel_prefetch), so RabbitMQ applies backpressure instead of the pool's retry loop.
  • Batch child-row inserts_insert_relationships / _insert_external_links use a single executemany, collapsing N+M round-trips to 2 and shrinking the transaction window.

The resilient pool's 5-retry "exhausted" path already surfaces a clear hard failure (common/postgres_resilient.py:543) and is left unchanged — it should now rarely trigger.

When to raise the cap instead

Only if 12 concurrent writers prove insufficient after these fixes — then raise the PgBouncer cap and POSTGRES_POOL_MAX_SIZE together, keeping the sum of service maxima under the new cap. The 45 cap was not the limiting factor; the uncoordinated 115 of demand was.

Tests & verification

  • New: resolve_postgres_pool_sizes unit tests (defaults, env override, invalid values, clamping); _channel_prefetch coupling tests; batched-insert tests (single executemany, invalid-row filtering, empty-batch no-op).
  • Updated: pool-size assertions in tableinator/brainztableinator main tests, process-with-relationships/links tests, stale test_batch_performance placeholder.
  • 2494 passed across common/brainztableinator/tableinator/api/insights; ruff, ruff format, mypy, bandit all green.

Does not change POSTGRES_HOST handling (host:port via PgBouncer remains the intended deployment).

🤖 Generated with Claude Code

… to fit shared PgBouncer cap

Under a PgBouncer pooler in session mode, every pooled connection pins a
dedicated Postgres backend for its lifetime against a hard per-database cap
(45). The sum of service pool maxima was 115 (api 10 + tableinator 50 +
brainztableinator 50 + insights 5), so a MusicBrainz bulk import drove
brainztableinator into a constant "pool exhausted" retry churn while backends
sat idle-in-transaction.

Root causes & fixes:
- Pool maxima uncoordinated and oversized (brainztableinator's 50 alone > cap).
  Add resolve_postgres_pool_sizes() with budget-aware, env-overridable defaults
  (POSTGRES_POOL_MIN_SIZE/MAX_SIZE); new sum of maxima ~36.
- brainztableinator had no concurrency bound: prefetch_count=200 x 4 consumers
  (up to 800 in-flight handlers) each grabbing a connection. Couple prefetch to
  pool capacity via channel-global QoS (_channel_prefetch).
- Per-row child inserts kept the transaction open across N+M round-trips
  (idle-in-transaction). Batch relationships/external_links via executemany.

Adds/updates tests for pool sizing, prefetch coupling, and batched inserts.
Adds docs/postgres-pool-exhaustion-analysis.md and updates configuration.md +
CLAUDE.md. Does not change POSTGRES_HOST handling.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014oJXBmfpaaPShUEDtakJiL
@codecov

codecov Bot commented Jun 21, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@github-actions

Copy link
Copy Markdown
Contributor

E2E Coverage (webkit)

Totals Coverage
Statements: 46.17% ( 1267 / 2744 )
Lines: 46.17% ( 1267 / 2744 )

StandWithUkraine

@github-actions

Copy link
Copy Markdown
Contributor

E2E Coverage (chromium)

Totals Coverage
Statements: 46.17% ( 1267 / 2744 )
Lines: 46.17% ( 1267 / 2744 )

StandWithUkraine

@github-actions

Copy link
Copy Markdown
Contributor

E2E Coverage (firefox)

Totals Coverage
Statements: 46.17% ( 1267 / 2744 )
Lines: 46.17% ( 1267 / 2744 )

StandWithUkraine

@github-actions

Copy link
Copy Markdown
Contributor

E2E Coverage (webkit - iPhone 15)

Totals Coverage
Statements: 46.17% ( 1267 / 2744 )
Lines: 46.17% ( 1267 / 2744 )

StandWithUkraine

@github-actions

Copy link
Copy Markdown
Contributor

E2E Coverage (webkit - iPad Pro 11)

Totals Coverage
Statements: 46.17% ( 1267 / 2744 )
Lines: 46.17% ( 1267 / 2744 )

StandWithUkraine

@SimplicityGuy SimplicityGuy merged commit 5710f0c into main Jun 21, 2026
57 checks passed
@SimplicityGuy SimplicityGuy deleted the worktree-fix+pg-pool-exhaustion branch June 21, 2026 20:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant