fix(postgres): right-size pools & bound brainztableinator concurrency to fit shared PgBouncer cap#396
Merged
Conversation
… to fit shared PgBouncer cap Under a PgBouncer pooler in session mode, every pooled connection pins a dedicated Postgres backend for its lifetime against a hard per-database cap (45). The sum of service pool maxima was 115 (api 10 + tableinator 50 + brainztableinator 50 + insights 5), so a MusicBrainz bulk import drove brainztableinator into a constant "pool exhausted" retry churn while backends sat idle-in-transaction. Root causes & fixes: - Pool maxima uncoordinated and oversized (brainztableinator's 50 alone > cap). Add resolve_postgres_pool_sizes() with budget-aware, env-overridable defaults (POSTGRES_POOL_MIN_SIZE/MAX_SIZE); new sum of maxima ~36. - brainztableinator had no concurrency bound: prefetch_count=200 x 4 consumers (up to 800 in-flight handlers) each grabbing a connection. Couple prefetch to pool capacity via channel-global QoS (_channel_prefetch). - Per-row child inserts kept the transaction open across N+M round-trips (idle-in-transaction). Batch relationships/external_links via executemany. Adds/updates tests for pool sizing, prefetch coupling, and batched inserts. Adds docs/postgres-pool-exhaustion-analysis.md and updates configuration.md + CLAUDE.md. Does not change POSTGRES_HOST handling. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014oJXBmfpaaPShUEDtakJiL
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Contributor
Contributor
Contributor
Contributor
Contributor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
In production, all long-lived services connect to a shared PostgreSQL through a PgBouncer pooler in session-pooling mode (
POSTGRES_HOST=pgbouncer:6432). In session mode every client connection pins a dedicated Postgres backend for its whole lifetime against a hard per-database cap of 45. During a MusicBrainz bulk import,brainztableinatorchurns constantly on⚠️ Connection pool exhausted (attempt 1/5)…while PgBouncer sits at 45/45 with ~18 clients queued and Postgres shows ~29 backends idle in transaction. The app collectively wants ~63 connections against the 45 cap.Full write-up:
docs/postgres-pool-exhaustion-analysis.md.Root causes (with evidence)
maxwas 115 (api 10 + tableinator 50 + brainztableinator 50 + insights 5) — 2.5× the cap.brainztableinator'smax=50alone exceeds the 45 cap. Sizes were copy-pasted "to match prefetch", not budgeted against the shared backend pool.brainztableinator. One transaction per message, withprefetch_count=200× 4 consumers = up to 800 in-flight handlers, each grabbing a pooled connection — permanently driving the pool to its ceiling. (tableinatornever exhausts despite the samemax=50because itsBatchProcessorsemaphore caps concurrent flushes at 2.)INSERTin a Python loop inside one open transaction — N+M sequential round-trips pinning the backend between statements.Fixes (app-side — the app was over-pooling)
resolve_postgres_pool_sizes()incommon/config.py, with per-service defaults (api 2/8, tableinator 2/12, brainztableinator 2/12, insights 1/4) and sharedPOSTGRES_POOL_MIN_SIZE/POSTGRES_POOL_MAX_SIZEoverrides. New sum of maxima ≈ 36 ≤ 45; idle footprint drops from 13 → 8.brainztableinatorprefetch to pool capacity — channel-global QoS (global_=True) withprefetch_count = pool max(_channel_prefetch), so RabbitMQ applies backpressure instead of the pool's retry loop._insert_relationships/_insert_external_linksuse a singleexecutemany, collapsing N+M round-trips to 2 and shrinking the transaction window.The resilient pool's 5-retry "exhausted" path already surfaces a clear hard failure (
common/postgres_resilient.py:543) and is left unchanged — it should now rarely trigger.When to raise the cap instead
Only if 12 concurrent writers prove insufficient after these fixes — then raise the PgBouncer cap and
POSTGRES_POOL_MAX_SIZEtogether, keeping the sum of service maxima under the new cap. The 45 cap was not the limiting factor; the uncoordinated 115 of demand was.Tests & verification
resolve_postgres_pool_sizesunit tests (defaults, env override, invalid values, clamping);_channel_prefetchcoupling tests; batched-insert tests (singleexecutemany, invalid-row filtering, empty-batch no-op).tableinator/brainztableinatormain tests, process-with-relationships/links tests, staletest_batch_performanceplaceholder.2494 passedacross common/brainztableinator/tableinator/api/insights;ruff,ruff format,mypy,banditall green.Does not change
POSTGRES_HOSThandling (host:port via PgBouncer remains the intended deployment).🤖 Generated with Claude Code