Database Indexing Strategies
In a world of increasing data volume and user expectations, smart indexing remains the highest-leverage optimization you can make.

Every engineer eventually hits the same wall: a query that used to run in milliseconds now takes seconds, and the UI starts to feel sluggish. You check the usual suspects: network latency, memory, caching. Often, though, the bottleneck is deeper, in how the database finds the rows it needs. Indexing is the map your database uses to navigate your data. A good map gets you to the destination fast. A poor one sends you wandering through the entire dataset. Getting indexing right is less about memorizing rules and more about understanding how your database accesses data in the real world.
In this article, we will walk through practical indexing strategies, from the fundamentals to advanced patterns that handle heavy write loads and complex analytical queries. We will look at real-world code examples in SQL and a Python application, discuss the tradeoffs that matter in production, and share lessons learned from projects that lived or died by their index design.
The role of indexing in modern applications
Indexing is not just a DBA relic from the 1990s. It sits at the heart of every high-performance application today, whether it’s a social media feed, an e-commerce product catalog, or a financial trading platform. Modern databases, from PostgreSQL to MongoDB and Elasticsearch, all rely on index structures to power fast reads.
At its core, an index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage space. The most common index type is the B-tree, which works well for a wide range of queries. Other types include hash indexes for equality lookups, bitmap indexes for low-cardinality columns (common in data warehousing), and specialized structures like GiST or SP-GiST in PostgreSQL for geometric and full-text search.
Who uses these? Backend developers building APIs, data engineers designing ETL pipelines, and product teams analyzing user behavior all depend on indexing. Compared to alternatives like in-memory caching (e.g., Redis), indexing is persistent and works directly on the data model, providing consistency and reducing the need for complex cache invalidation logic.
Core concepts and practical patterns
B-tree indexes: The workhorse
The B-tree is the default index in most relational databases. It keeps data sorted and allows for efficient range queries, sorting, and equality lookups. In PostgreSQL, a B-tree index on a users table’s email column makes lookups by email nearly instantaneous, even with millions of rows.
Consider a typical user table:
CREATE TABLE users (
id SERIAL PRIMARY KEY,
email VARCHAR(255) NOT NULL,
created_at TIMESTAMP NOT NULL DEFAULT NOW()
);
A simple index on email looks like this:
CREATE INDEX idx_users_email ON users (email);
In practice, this index will be used for queries like:
SELECT * FROM users WHERE email = 'alice@example.com';
Without the index, the database would perform a sequential scan, reading every row. With the index, it navigates the B-tree directly. This difference can be the difference between a sub-millisecond response and a multi-second one on large tables.
Composite indexes: Order matters
When filtering by multiple columns, a composite (multi-column) index can be far more effective than separate single-column indexes. The order of columns in the index is critical. The database can use the index from left to right, so the most selective columns should come first.
For example, an e-commerce orders table might be filtered by customer_id and order_date:
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
customer_id INT NOT NULL,
order_date DATE NOT NULL,
amount NUMERIC(10,2) NOT NULL
);
CREATE INDEX idx_orders_customer_date ON orders (customer_id, order_date);
A query like this will benefit:
SELECT * FROM orders
WHERE customer_id = 42 AND order_date >= '2023-01-01';
But a query filtering only by order_date will not fully utilize this index, because the index’s leftmost column (customer_id) is not specified. In such cases, a separate index on order_date might be needed.
Covering indexes: Avoid the table lookup
A covering index includes all the columns required by a query, allowing the database to answer the query using only the index. This avoids the costly step of looking up rows in the main table.
PostgreSQL supports this with index-only scans. For example:
CREATE INDEX idx_orders_covering ON orders (customer_id, order_date, amount);
Now, this query can be satisfied entirely from the index:
SELECT customer_id, order_date, amount
FROM orders
WHERE customer_id = 42;
The database reads the index, finds the matching entries, and returns the data without touching the table. This is especially valuable for reporting queries that read many rows but only a few columns.
Partial indexes: Index only what you need
Partial indexes index a subset of rows, which can dramatically reduce index size and write overhead. They are perfect for queries that filter on a specific condition.
Suppose you only care about active users:
CREATE INDEX idx_active_users_email ON users (email) WHERE active = true;
Queries that filter for active users will use this small, efficient index. Inactive users are ignored, saving space and maintenance cost.
Functional indexes: Index expressions
Modern databases let you index the result of a function or expression. This is essential for case-insensitive searches or computed values.
For example, to enforce case-insensitive email lookups:
CREATE INDEX idx_users_email_lower ON users (lower(email));
Then queries can use the index naturally:
SELECT * FROM users WHERE lower(email) = lower('Alice@Example.com');
In PostgreSQL 12+, this also works with ILIKE patterns, making it practical for user-facing search.
Hash indexes: Fast equality, no ranges
Hash indexes are optimized for equality comparisons. They are generally smaller than B-trees but cannot handle range queries. In PostgreSQL, they are less common but can be useful for specific workloads:
CREATE INDEX idx_users_email_hash ON users USING HASH (email);
Use them when you only need exact matches and want to minimize index size.
Indexing JSON and full-text search
Databases like PostgreSQL and MySQL support indexing JSON fields and full-text search, which is critical for modern applications.
For JSON in PostgreSQL, you can create a GIN index on a JSONB column:
CREATE TABLE products (
id SERIAL PRIMARY KEY,
attributes JSONB NOT NULL
);
CREATE INDEX idx_products_attributes ON products USING GIN (attributes);
Queries can then use the index for containment and existence checks:
SELECT * FROM products WHERE attributes @> '{"color": "blue"}';
For full-text search, PostgreSQL’s tsvector and tsquery types can be indexed with GIN for fast text search:
CREATE INDEX idx_products_description_fts ON products
USING GIN (to_tsvector('english', description));
Tradeoffs and evaluation
Strengths
- Performance: Proper indexes can reduce query times by orders of magnitude.
- Flexibility: Modern databases offer a wide range of index types for different data and query patterns.
- Persistence: Indexes survive restarts and don’t require external caching layers.
Weaknesses
- Write overhead: Every index adds cost to INSERT, UPDATE, and DELETE operations. More indexes mean slower writes.
- Storage cost: Indexes consume disk space and memory. On large tables, indexes can rival the table size.
- Maintenance: Indexes can become fragmented or unused over time. Regular analysis is needed.
Tradeoffs
- Read vs. Write Balance: In write-heavy systems, fewer, carefully chosen indexes are better. In read-heavy systems, more indexes are acceptable.
- Selectivity: High-cardinality columns (like emails) benefit more from indexes than low-cardinality columns (like boolean flags).
- Query Patterns: Index for your most common and critical queries. Don’t index for every possible filter.
When to avoid indexing
- Small tables: Sequential scans are often faster than index scans for tiny tables.
- Write-heavy tables: Too many indexes can cripple write throughput.
- Infrequently queried data: Index maintenance cost isn’t worth it for rarely accessed data.
Real-world code context: Building a performant API
Let’s look at a real-world scenario: a Python API using PostgreSQL for a user management service. The goal is to ensure fast user lookups and paginated listings.
Project structure
user_service/
├── app/
│ ├── __init__.py
│ ├── main.py
│ ├── models.py
│ ├── db.py
│ └── queries.py
├── migrations/
│ └── 001_initial.sql
├── requirements.txt
└── Dockerfile
Database setup and migrations
The migrations/001_initial.sql file creates tables and indexes:
-- migrations/001_initial.sql
CREATE TABLE users (
id SERIAL PRIMARY KEY,
email VARCHAR(255) NOT NULL,
active BOOLEAN NOT NULL DEFAULT true,
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
updated_at TIMESTAMP NOT NULL DEFAULT NOW()
);
-- Core index for login
CREATE UNIQUE INDEX idx_users_email ON users (email);
-- Partial index for active user queries
CREATE INDEX idx_active_users_created ON users (created_at) WHERE active = true;
-- Functional index for case-insensitive search
CREATE INDEX idx_users_email_lower ON users (lower(email));
Application code: Queries with indexing
In app/db.py, we set up a connection pool using asyncpg for async I/O:
# app/db.py
import asyncpg
from contextlib import asynccontextmanager
DSN = "postgresql://user:password@localhost:5432/userdb"
@asynccontextmanager
async def get_connection():
conn = await asyncpg.connect(DSN)
try:
yield conn
finally:
await conn.close()
In app/queries.py, we write queries that leverage our indexes:
# app/queries.py
from app.db import get_connection
async def get_user_by_email(email: str):
async with get_connection() as conn:
# This uses idx_users_email_lower due to lower() usage
row = await conn.fetchrow(
"SELECT * FROM users WHERE lower(email) = lower($1)",
email
)
return dict(row) if row else None
async def list_active_users(limit: int = 50, offset: int = 0):
async with get_connection() as conn:
# Uses idx_active_users_created for fast filtering and ordering
rows = await conn.fetch(
"SELECT id, email, created_at FROM users "
"WHERE active = true ORDER BY created_at DESC LIMIT $1 OFFSET $2",
limit, offset
)
return [dict(r) for r in rows]
In app/main.py, we expose these via an API:
# app/main.py
from fastapi import FastAPI, HTTPException
from app.queries import get_user_by_email, list_active_users
app = FastAPI()
@app.get("/users/{email}")
async def read_user(email: str):
user = await get_user_by_email(email)
if not user:
raise HTTPException(status_code=404, detail="User not found")
return user
@app.get("/users/active")
async def read_active_users(limit: int = 50, offset: int = 0):
return await list_active_users(limit, offset)
Observability: Checking index usage
In PostgreSQL, you can check if queries use indexes via EXPLAIN:
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM users WHERE lower(email) = lower('alice@example.com');
Look for “Index Scan” or “Index Only Scan” in the output. If you see “Seq Scan” on a large table, it’s a sign that your index may not be used due to low selectivity, missing statistics, or function mismatches.
Performance tuning example
Suppose the list_active_users endpoint is slow when offset gets large. This is a classic pagination problem. Offset-based pagination requires the database to scan and discard rows. For better performance, consider keyset (cursor-based) pagination:
# app/queries.py (continued)
async def list_active_users_cursor(last_created_at=None, limit: int = 50):
async with get_connection() as conn:
if last_created_at:
rows = await conn.fetch(
"SELECT id, email, created_at FROM users "
"WHERE active = true AND created_at < $1 "
"ORDER BY created_at DESC LIMIT $2",
last_created_at, limit
)
else:
rows = await conn.fetch(
"SELECT id, email, created_at FROM users "
"WHERE active = true ORDER BY created_at DESC LIMIT $1",
limit
)
return [dict(r) for r in rows]
This approach uses the composite index on (created_at) (or a covering index) efficiently, avoiding the overhead of large offsets.
Personal experience: Lessons from the trenches
I once worked on a multi-tenant SaaS platform where every customer had their own set of orders. Queries filtered by tenant_id and order_date. Initially, we had a single index on tenant_id and another on order_date. Performance was acceptable for small tenants but degraded for larger ones.
The turning point was adding a composite index: idx_orders_tenant_date (tenant_id, order_date). Suddenly, queries for a specific tenant over a date range became sub-second, even with millions of rows. But we learned the hard way that writes slowed down. Each new order now had to update two indexes. We balanced this by removing a redundant index on order_date alone, since the composite index could handle most queries.
Another lesson came from functional indexes. We had a case-insensitive login query that used LOWER(email) in the WHERE clause but no index. After adding idx_users_email_lower, login times dropped from 300ms to 5ms. However, we forgot to update the application code to use lower(email) consistently. Queries with email = 'Alice@Example.com' still did a sequential scan. This taught me that indexing is a contract between the database and the application. Both sides must honor it.
Finally, we faced a tricky issue with partial indexes. We had a status column with values like active, pending, archived. Most queries looked for status = 'active'. We created a partial index on status for active rows. But a new feature required listing pending orders. The partial index couldn’t help. We had to adjust the index or add a new one. This highlighted the need to revisit indexing strategies as requirements evolve.
Getting started: Workflow and mental models
Indexing is not a one-time task. It’s an iterative process. Here’s a mental model for getting started:
-
Identify critical queries: Use your application’s logs or APM tools (like New Relic or Datadog) to find slow queries. Focus on endpoints with high traffic and strict latency requirements.
-
Analyze query plans: For each critical query, run
EXPLAIN ANALYZEto see how the database executes it. Look for sequential scans on large tables, sorting operations, or nested loops that could be avoided. -
Design indexes: Start with equality filters, then range filters. Consider composite indexes with the most selective columns first. Think about covering indexes to avoid table lookups.
-
Measure impact: Create the index in a staging environment. Compare query plans and performance before and after. Use tools like
pg_stat_user_indexesin PostgreSQL to monitor index usage. -
Iterate: As your application changes, new queries emerge and old ones fade. Periodically review unused indexes and remove them to reduce write overhead.
Tooling
- PostgreSQL: Use
pg_stat_statementsto track query performance. ThepgBadgertool can analyze logs for index usage patterns. - MySQL: The
EXPLAINcommand and thePerformance Schemaprovide insights.pt-index-usagefrom Percona Toolkit helps identify unused indexes. - MongoDB: The
explain()method and themongostatutility show index usage. TheDatabase Profilerlogs slow queries. - General: APM tools (New Relic, Datadog) and database monitoring (Prometheus + Grafana) are invaluable for spotting trends.
For a simple start, you can add a monitoring query to your application’s health check:
# app/main.py (extended)
import asyncpg
@app.get("/health")
async def health_check():
async with get_connection() as conn:
# Check if critical indexes exist and are used
index_stats = await conn.fetch("""
SELECT schemaname, tablename, indexname, idx_scan, idx_tup_read, idx_tup_fetch
FROM pg_stat_user_indexes
WHERE indexname LIKE 'idx_%'
ORDER BY idx_scan DESC
LIMIT 10;
""")
return {
"status": "ok",
"top_indexes": [dict(s) for s in index_stats]
}
This gives a quick glimpse into which indexes are actively used.
Free learning resources
- Use The Index, Luke! (https://use-the-index-luke.com/): A fantastic, free resource explaining indexing in relational databases with clear examples. It covers B-trees, composite indexes, and database-specific quirks.
- PostgreSQL Documentation on Indexes (https://www.postgresql.org/docs/current/indexes.html): The official docs are thorough and include details on index types, partial indexes, and performance considerations.
- MongoDB Indexing Strategies (https://www.mongodb.com/docs/manual/applications/indexes/): A practical guide to indexing in MongoDB, with examples for common query patterns.
- Percona MySQL Indexing Guide (https://www.percona.com/mysql-performance-tuning/indexes): A hands-on resource for MySQL, including tools for analyzing index usage.
- SQL Indexing and Tuning Webinar Series by Markus Winand (https://winand.at/SQL-Indexing-Tuning): Free webinars that dive deep into real-world indexing strategies.
These resources are grounded in real database behavior and avoid theoretical fluff.
Conclusion
Database indexing is a powerful tool for improving query performance, but it requires thoughtful design and ongoing maintenance. It shines in read-heavy applications with well-defined query patterns, such as user authentication, product searches, and time-series reporting. For write-heavy systems like event logging or real-time analytics, indexes must be used sparingly to avoid write bottlenecks.
Who should invest heavily in indexing? Backend developers and data engineers building services where response time directly impacts user experience or revenue. Who might skip it? Prototype projects or applications with tiny datasets where simplicity trumps performance.
The takeaway is that indexing is not a set-and-forget feature. It’s a dialogue between your data, your queries, and your infrastructure. Start with your most critical paths, measure relentlessly, and iterate. When done right, indexing transforms sluggish applications into responsive ones, making the complex simple and the slow fast.




