Common Backend Mistakes That Cause Scaling Problems
Most backend scaling problems are not infrastructure problems. They are code problems that infrastructure cannot fix. Adding more servers to a backend with an N+1 query problem makes the problem proportionally worse, not better. These are the eight mistakes that most reliably turn a working backend into a bottleneck — and the specific changes that fix them.
Mistake 1: The N+1 Query Problem
The N+1 problem occurs when a query fetches a list of N records and then makes N additional queries to fetch related data for each record. A query that looks correct in development becomes catastrophic in production:
- Example: Fetching 100 orders and then querying the database for each order's customer — 101 queries instead of 2
- Fix in SQLAlchemy: Use joinedload() or selectinload() for eager loading of related records
- Fix for raw queries: Use a JOIN or a single IN query to fetch all related records in one round trip
- Detection: SQLAlchemy's echo=True mode or Django Debug Toolbar reveals query counts per request
- A single endpoint with an N+1 problem can go from 50ms at 10 records to 5,000ms at 1,000 records
Mistake 2: Missing Database Indexes
A query without an index on a table with 1 million rows performs a full table scan — reading every row to find the matching ones. This is the single most common cause of slow database queries in production:
- Every foreign key column must have an index (PostgreSQL does not create them automatically)
- Every column used in a WHERE, ORDER BY, or JOIN clause on a large table needs an index
- Composite indexes (multiple columns) are needed for queries that filter on multiple columns simultaneously
- Use EXPLAIN ANALYZE in PostgreSQL to identify which queries are missing indexes
- The fix is one migration: CREATE INDEX CONCURRENTLY — concurrent creation does not lock the table
Mistake 3: No Connection Pooling
PostgreSQL has a limit on simultaneous connections (typically 100–200). Without connection pooling, each request opens a new connection and closes it on completion. At 50 concurrent users, this saturates the database:
- Use asyncpg's built-in connection pool for FastAPI (min_size=5, max_size=20 for most apps)
- For larger scale, use PgBouncer as a connection pooler in front of PostgreSQL
- Symptom: "FATAL: remaining connection slots are reserved" errors under load
- Connection pool size should be approximately: (number of workers × 2) + 1
Mistake 4: Synchronous Operations in Async Routes
Mixing synchronous and async code in FastAPI routes blocks the event loop and eliminates the concurrency advantage:
- Using synchronous database drivers (psycopg2) in async routes blocks the event loop for the duration of every query
- Using requests library instead of httpx for external API calls blocks the event loop
- Using time.sleep() instead of asyncio.sleep() blocks the event loop
- Fix: audit every import in your routes — psycopg2 → asyncpg, requests → httpx, time.sleep → asyncio.sleep
Mistake 5: No Caching for Repeated Expensive Queries
An endpoint that runs the same expensive query on every request regardless of whether the underlying data changed wastes database resources proportionally to traffic:
- Cache reference data (categories, plans, country lists) in Redis with a 5–60 minute TTL
- Cache user permission sets after authentication — avoid re-querying role/permission tables on every request
- Cache aggregate queries (dashboard counts, totals) with a 30–300 second TTL
- Use cache-aside pattern: check cache → if miss, query DB → store in cache → return result
Mistake 6: Heavy Processing in the Request-Response Cycle
Any operation that takes more than 200ms should not happen synchronously in the HTTP response cycle:
- Email sending: send to a task queue, return 200 immediately
- PDF generation: offload to Celery, return a job ID, poll for completion
- Image processing: use S3 pre-signed URLs for direct upload, process async
- Report generation: async job with webhook or polling — never block the HTTP response
- Webhook delivery: always async — external endpoints may be slow or unavailable
Implementation Checklist
- Run EXPLAIN ANALYZE on your 10 most-called queries — fix missing indexes first
- Connection pooling is configured (asyncpg pool or PgBouncer)
- No synchronous database drivers or HTTP libraries are used in async routes
- N+1 queries are eliminated with eager loading or single batch queries
- Redis caching is in place for reference data and expensive aggregates
- Background tasks handle email, PDF, webhooks, and any operation over 200ms
- Load test your API before launch — identify bottlenecks before real users do
Common Mistakes to Avoid
- ✗Adding more servers before fixing the code — horizontal scaling multiplies N+1 problems proportionally
- ✗Caching everything without understanding cache invalidation — stale cache causes data integrity bugs
- ✗Not measuring before optimising — use profiling (Pyinstrument, cProfile) to find the actual bottleneck
- ✗Over-indexing — too many indexes slow down writes; index only the columns you query on
- ✗Ignoring slow query logs — PostgreSQL's log_min_duration_statement setting surfaces every slow query automatically
Frequently Asked Questions
Need help applying these principles to your project? We build exactly this for startups worldwide.