Postgres performance for SaaS at 100 tenants: what actually breaks

A multi-tenant SaaS on Postgres works fine until it doesn’t. The “doesn’t” comes in stages. Each stage breaks something specific, in a predictable order. After running multi-tenant Postgres for a decade across various platforms, I plan for the order. Skip a stage and you eat downtime later.

This is the playbook.

Stage 0: 1 to 50 tenants

Nothing breaks. A single Postgres instance with row-level security, decent indexes, and a connection pool handles this comfortably. Most teams over-engineer at this stage by adding read replicas or caching layers they don’t need.

Do this:

Single instance. 4-8 cores, 16-32GB RAM, depending on data volume.
Connection pool (PgBouncer or built-in app pooler) sized to ~100 connections.
Indexes on every tenant_id column.
pg_stat_statements enabled from day one.

Don’t do this:

Read replicas (you don’t have read pressure yet).
Sharding (you don’t have data volume yet).
Caching layer (Postgres is faster than your cache invalidation logic).

Stage 1: 50 to 200 tenants - the connection pool stage

The first thing that breaks is connections. Each app instance has a pool of, say, 10 connections to Postgres. If you scale the app to 5 instances, you have 50 connections. Add background workers, you have 70. Add a couple of cron jobs, you have 90. Postgres on a small instance defaults to max_connections = 100. You hit the wall.

The symptom: FATAL: sorry, too many clients already. Sometimes intermittent. Sometimes during a deploy when the new and old instances are both alive.

The fix is PgBouncer. Run it in transaction-pooling mode. Now your app’s 50 connections funnel into PgBouncer’s 50 connections, which fan out to Postgres’s 25 actual connections. The math works because most app connections are idle most of the time.

If you skip PgBouncer at stage 1, you’ll hit the wall around 200 tenants and fix it under pressure. Plan ahead.

Stage 2: 200 to 500 tenants - the slow query stage

The second thing that breaks is query plans. With 200 tenants, your largest tenant has, say, 500,000 rows in attendance. The smallest has 200. Postgres picks query plans based on table-level statistics. The plan that’s fast for the average tenant is slow for the largest.

The symptom: most queries return in 50ms but some return in 5 seconds. Customer support gets a ticket. You dig. The query is fine. The plan is bad.

Two fixes:

Per-tenant statistics. ANALYZE is global. You can extend the statistics with a function that biases toward the current tenant. In practice this is hard. The simpler alternative is to use planner hints (via pg_hint_plan extension) for the queries that misbehave.

Stricter indexes. A composite index on (tenant_id, created_at) for time-windowed queries. A (tenant_id, status, ...) index for filter queries. The bigger the largest tenant, the more partition-aware your indexes need to be.

Also at this stage: turn on log_min_duration_statement = 500 so anything over half a second hits the log. Read the log every Friday. Fix the top three each week.

Stage 3: 500 to 1500 tenants - the auto-vacuum stage

The third thing that breaks is auto-vacuum. With high write volume across many tenants, the dead-tuple count grows faster than auto-vacuum can clean it up. Tables bloat. Sequential scans get slower. Bloat compounds.

The symptom: queries that were fast last month are slow this month, even though the data hasn’t grown much. pg_stat_user_tables shows dead-tuple counts approaching the live-tuple count.

Three fixes, in order of effort:

Tune auto-vacuum. autovacuum_vacuum_scale_factor from 0.2 (the default) to 0.05 for hot tables. Run vacuum more aggressively. This helps until your write volume crosses the threshold where auto-vacuum can’t keep up no matter how often it runs.
Partition the hot tables by tenant or time. A partitioned attendance table with monthly partitions stays small per partition, vacuums fast, and indexes are cheaper. Postgres 13+ has good partition support. Use it.
Move from UPDATE-heavy patterns to insert-only. Append-only tables don’t bloat. Convert your “update last_seen” patterns into “insert presence event” patterns. Your data warehouse will thank you anyway.

Stage 4: 1500+ tenants - the hard stage

Past 1500 tenants on a single instance, you’re running into hardware limits. CPU saturation, IO bottleneck, replication lag, vacuum freeze concerns.

The fixes here are real architecture changes:

Read replicas for the read-heavy queries. Route the dashboard queries to replicas. Write traffic stays on the primary.
Citus or a sharding layer to distribute tenants across multiple Postgres instances. This is a significant rewrite. Most teams will solve their problem before they get here.
Move the noisy neighbours to their own instance. The largest tenant by data volume often dominates the load. Promote them to a dedicated database (escape the RLS hatch). The remaining 1499 tenants share an instance.

I have not personally taken a system past 1500 tenants on a single Postgres. I have shipped systems that went past 500 and stayed comfortable on the 500-tenant playbook. The 1500+ playbook is documented by teams much larger than the ones I have led; trust them.

What you do not need to do

Add Redis as a query cache. Postgres caches query results in memory better than you’ll cache them. Redis is for things Postgres is bad at (cross-process pub-sub, leaderboards, session tokens).
Switch to “a more performant database.” Postgres at the right tuning is faster than the alternatives at default tuning. The grass is the same.
Pre-shard for tenants you don’t have. Premature distribution is worse than premature optimisation. You will pick the wrong shard key.