Data Modeling for Modern Applications

·16 min read·Data and AIintermediate

Why how you structure data now affects performance, reliability, and developer velocity

A diagram placeholder for data schema showing tables, fields, and relations as boxes and lines, with primary keys and foreign keys labeled clearly

When I started building APIs in earnest, my data model usually lived in my head and in the ORM annotations on my entity classes. If a feature request came in, I added a field. If a query was slow, I added an index. That worked for prototypes and small internal tools, but the moment you introduce concurrent users, long-lived workflows, and evolving requirements, those choices start to compound. The database becomes a bottleneck, migrations get risky, and the “just one more join” pattern quietly turns a straightforward endpoint into a latency trap.

Modern applications are not just about storing rows and retrieving them. They involve streaming events, caching hot paths, serving real-time dashboards, and powering machine learning features. The data model underpins all of it. Getting it right is less about choosing the perfect schema on day one and more about understanding the tradeoffs you’re making in terms of access patterns, consistency, and scalability. In this article, I’ll share a practical approach to data modeling informed by real projects: evented systems, microservices, analytics pipelines, and mobile apps that need offline sync.

Where modern data modeling fits today

Data modeling is not a single technique or tool. It’s a set of decisions about how to represent the domain, how to guarantee invariants, and how to support the workloads your application will face. In practice, developers use a mix of relational models for transactional integrity, document models for flexible schemas, and event-driven patterns for decoupling services.

In microservice architectures, each service often owns its own data model. That ownership improves autonomy but introduces challenges like eventual consistency, data duplication, and the need for event sourcing or change data capture to keep downstream consumers in sync. In real-time analytics, you often see a separate read model optimized for aggregations, while the write model is optimized for fast ingestion and correctness.

The dominant pattern I see in the field is polyglot persistence. A relational database (like PostgreSQL) is used for transactional workflows, a document store (like MongoDB) for content-rich entities with flexible attributes, and a streaming platform (like Kafka) for event propagation and materialized views. Choosing the right storage and modeling for its strengths is more impactful than chasing a single “universal” model.

Core concepts and practical modeling techniques

Entities, value objects, and boundaries

In domain-driven design, entities represent objects with identity, while value objects are immutable and defined by their attributes. In data modeling terms, this often translates to tables with stable primary keys for entities and either embedded JSON or separate tables for value objects. The choice affects how you query and update data.

For example, an Order entity may contain a ShippingAddress value object. If you rarely query by address fields and don’t need them in joins, embedding them as JSON in PostgreSQL can reduce joins and keep the model flexible. If you do need to search by zip code or run reports, a separate addresses table with proper indexes may be better.

Normalization vs. denormalization

Normalization reduces duplication and protects invariants but increases join complexity. Denormalization improves read performance but can make writes more expensive and introduces potential staleness. In practice, you often normalize the write model and denormalize the read model.

A common pattern is to maintain a normalized orders table for transactions and a denormalized order_summary table for dashboards, populated by a background job or an event stream. This keeps critical write paths simple while allowing analytics queries to run fast.

Handling time and change

Time is a first-class dimension in modern data models. Event sourcing records changes as a sequence of events rather than overwriting state. This is powerful for auditability, replayability, and building multiple read models from a single source of truth. The tradeoff is complexity: you’ll need snapshotting, compaction, and careful event versioning.

Change data capture (CDC) is another practical technique, capturing database changes and streaming them to consumers. Tools like Debezium can publish row-level changes into Kafka, enabling downstream services to maintain their own materialized views without direct DB access.

Consistency and partitions

Distributed systems force choices about consistency. Strong consistency in a single database is straightforward; across services or regions, it’s expensive. Eventual consistency is common in microservices. When modeling, design your data to tolerate temporary mismatches and use compensating actions to correct state.

Partitioning and sharding improve throughput but add operational complexity. For time-series data, partitioning by day is effective and keeps indexes smaller. For global apps, consider geo-partitioning to keep data close to users.

Caching and read models

Read-heavy applications benefit from precomputed views and caches. For example, a product catalog may maintain a materialized view of product_availability that combines inventory from multiple services. This view is updated via events and cached at the edge. The data model includes fields chosen specifically for the query shape, not for general-purpose use.

Real-world code context: modeling an order workflow in PostgreSQL with event sourcing

Let’s build a practical example: an order management system where orders transition through states, inventory is reserved, and payments are captured. We’ll model the write path with normalized tables, the read path with a materialized view, and capture state changes with event sourcing.

Project structure

This is a small service with clear boundaries. The structure reflects write models, read models, and events.

order-service/
├── migrations/
│   ├── 001_create_orders.up.sql
│   ├── 002_create_order_events.up.sql
│   ├── 003_create_inventory.up.sql
│   └── 004_create_order_summary_mv.up.sql
├── src/
│   ├── db/
│   │   ├── conn.go
│   │   ├── models.go
│   │   └── migrations.go
│   ├── events/
│   │   ├── producer.go
│   │   └── consumer.go
│   ├── service/
│   │   ├── order_service.go
│   │   └── inventory_service.go
│   └── main.go
├── docker-compose.yml
└── README.md

Database schema: normalized write model

The orders table keeps the core state. The order_events table records state transitions. Inventory is separate, reflecting a bounded context.

-- migrations/001_create_orders.up.sql
CREATE TABLE orders (
    id UUID PRIMARY KEY,
    customer_id UUID NOT NULL,
    status TEXT NOT NULL CHECK (status IN ('pending', 'paid', 'fulfilling', 'shipped', 'cancelled')),
    total_amount NUMERIC(12, 2) NOT NULL,
    currency TEXT NOT NULL,
    shipping_address JSONB NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_orders_customer_id ON orders (customer_id);
CREATE INDEX idx_orders_status ON orders (status);
-- migrations/002_create_order_events.up.sql
CREATE TABLE order_events (
    id BIGSERIAL PRIMARY KEY,
    order_id UUID NOT NULL REFERENCES orders(id),
    event_type TEXT NOT NULL,
    payload JSONB NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_order_events_order_id ON order_events (order_id);
CREATE INDEX idx_order_events_created_at ON order_events (created_at);
-- migrations/003_create_inventory.up.sql
CREATE TABLE inventory (
    sku TEXT PRIMARY KEY,
    available INTEGER NOT NULL DEFAULT 0,
    reserved INTEGER NOT NULL DEFAULT 0,
    updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_inventory_available ON inventory (available) WHERE available > 0;

Event-driven write path in Go

The service layer ensures invariants and emits events. Here’s a simplified flow to reserve inventory and mark an order as paid. We’ll use a transaction to keep consistency within the database, and then produce events to Kafka for downstream consumers.

// src/service/order_service.go
package service

import (
	"context"
	"encoding/json"
	"errors"
	"github.com/jackc/pgx/v5"
	"github.com/jackc/pgx/v5/pgxpool"
)

type OrderService struct {
	db      *pgxpool.Pool
	producer EventProducer // abstract interface to Kafka or in-memory for tests
}

type CreateOrderRequest struct {
	CustomerID     string                 `json:"customer_id"`
	Items          []OrderItem            `json:"items"`
	ShippingAddress map[string]interface{} `json:"shipping_address"`
	Currency       string                 `json:"currency"`
}

type OrderItem struct {
	Sku    string `json:"sku"`
	Quantity int    `json:"quantity"`
	Price   string `json:"price"` // decimal as string for precision
}

func (s *OrderService) CreateOrder(ctx context.Context, req CreateOrderRequest) (string, error) {
	tx, err := s.db.Begin(ctx)
	if err != nil {
		return "", err
	}
	defer tx.Rollback(ctx)

	// Compute total and validate inventory
	total, err := computeTotal(req.Items)
	if err != nil {
		return "", err
	}

	for _, item := range req.Items {
		if err := reserveInventory(ctx, tx, item.Sku, item.Quantity); err != nil {
			return "", err
		}
	}

	orderID := genUUID()
	shippingAddrBytes, _ := json.Marshal(req.ShippingAddress)

	_, err = tx.Exec(ctx, `
		INSERT INTO orders (id, customer_id, status, total_amount, currency, shipping_address)
		VALUES ($1, $2, 'pending', $3, $4, $5)
	`, orderID, req.CustomerID, total, req.Currency, string(shippingAddrBytes))
	if err != nil {
		return "", err
	}

	// Record event
	payload, _ := json.Marshal(map[string]interface{}{
		"customer_id": req.CustomerID,
		"items":       req.Items,
		"total":       total,
	})
	_, err = tx.Exec(ctx, `
		INSERT INTO order_events (order_id, event_type, payload)
		VALUES ($1, 'order_created', $2)
	`, orderID, string(payload))
	if err != nil {
		return "", err
	}

	if err := tx.Commit(ctx); err != nil {
		return "", err
	}

	// Emit event for downstream services (inventory, notifications, analytics)
	_ = s.producer.Produce(ctx, "order_events", map[string]interface{}{
		"order_id": orderID,
		"type":     "order_created",
		"payload":  payload,
	})

	return orderID, nil
}

func reserveInventory(ctx context.Context, tx pgx.Tx, sku string, qty int) error {
	res, err := tx.Exec(ctx, `
		UPDATE inventory
		SET reserved = reserved + $1,
		    available = available - $1,
		    updated_at = NOW()
		WHERE sku = $2 AND available >= $1
	`, qty, sku)
	if err != nil {
		return err
	}
	if res.RowsAffected() == 0 {
		return errors.New("insufficient inventory for sku: " + sku)
	}
	return nil
}

func computeTotal(items []OrderItem) (string, error) {
	// In real code, use decimal arithmetic. For brevity, we sum strings.
	total := 0.0
	for _, it := range items {
		p, err := parseFloat(it.Price)
		if err != nil {
			return "", err
		}
		total += p * float64(it.Quantity)
	}
	return formatMoney(total), nil
}

Read model: materialized view for dashboards

Analysts and dashboards rarely need the full normalized model. They want fast aggregations. A materialized view can precompute revenue by day, customer, or region.

-- migrations/004_create_order_summary_mv.up.sql
CREATE MATERIALIZED VIEW order_summary AS
SELECT
    o.id AS order_id,
    o.customer_id,
    o.status,
    o.total_amount,
    o.currency,
    DATE(o.created_at) AS order_date,
    (o.shipping_address->>'country') AS country
FROM orders o;

CREATE INDEX idx_order_summary_date ON order_summary (order_date);
CREATE INDEX idx_order_summary_customer ON order_summary (customer_id);

In production, refresh this view on a schedule or incrementally update it via events. For example, a small Go service consumes order_events and updates a daily_revenue table. This avoids scanning the full orders table for every dashboard load.

Event sourcing variant for state transitions

If you need full auditability and the ability to rebuild state, add an event store and build a read model by replaying events. This pattern shines for compliance and debugging.

-- Additional event store for state transitions
CREATE TABLE order_state_events (
    event_id BIGSERIAL PRIMARY KEY,
    order_id UUID NOT NULL,
    event_type TEXT NOT NULL,
    event_time TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    payload JSONB NOT NULL
);

CREATE INDEX idx_order_state_events_order_id ON order_state_events (order_id);

Rebuilding the current state for an order:

// src/service/rebuild_state.go
package service

import (
	"context"
	"encoding/json"
	"github.com/jackc/pgx/v5/pgxpool"
)

type OrderState struct {
	Status string `json:"status"`
	Total  string `json:"total"`
}

func RebuildOrderState(ctx context.Context, db *pgxpool.Pool, orderID string) (*OrderState, error) {
	rows, err := db.Query(ctx, `
		SELECT event_type, payload
		FROM order_state_events
		WHERE order_id = $1
		ORDER BY event_id
	`, orderID)
	if err != nil {
		return nil, err
	}
	defer rows.Close()

	state := &OrderState{}
	for rows.Next() {
		var eventType string
		var payload []byte
		if err := rows.Scan(&eventType, &payload); err != nil {
			return nil, err
		}
		switch eventType {
		case "order_created":
			var data map[string]interface{}
			json.Unmarshal(payload, &data)
			state.Status = "pending"
			if t, ok := data["total"].(string); ok {
				state.Total = t
			}
		case "payment_captured":
			state.Status = "paid"
		case "order_cancelled":
			state.Status = "cancelled"
		}
	}
	return state, nil
}

A fun language note: Go’s encoding/json is convenient for event payloads, but be mindful of numeric types. JSON numbers decode to float64 by default, which can lose precision for money. In production, decode to json.Number or use a decimal library.

Strengths, weaknesses, and tradeoffs

When this modeling approach works well

  • Transactional workloads that need strong consistency within a single database boundary.
  • Systems with clear bounded contexts where services can own their data.
  • Applications requiring auditability, compliance, and the ability to rebuild state from events.
  • Read-heavy dashboards that benefit from precomputed views and targeted indexes.

Where it struggles

  • Cross-service transactions that require distributed locks or two-phase commit; these are brittle and scale poorly.
  • Overusing materialized views without a refresh strategy can lead to stale data.
  • Event sourcing introduces storage growth and replay complexity. Without snapshotting, rebuilding state can be slow.
  • Over-normalization can make simple queries cumbersome and slow; over-denormalization can complicate updates.

Alternative approaches

  • Graph databases excel for relationships-heavy domains like social networks or fraud detection.
  • Columnar stores are better for analytics queries on large time-series datasets.
  • Key-value stores can serve simple lookups and caching layers with minimal latency, but lack relational invariants.
  • NoSQL document stores are helpful when schemas are fluid and queries are document-centric.

In practice, pick the model that fits your primary access patterns. It’s common to blend them: relational for writes, a document store for content delivery, and a stream processor for derived views.

Personal experience and common pitfalls

In one project, we migrated from a single “mega-table” with sparse columns to a normalized schema with JSONB for optional attributes. Queries became clearer, but we initially forgot to index JSON fields used in filters. The symptom was slow search on a dashboard. Adding a GIN index on the JSONB column improved performance by an order of magnitude. Lesson: model for your query shapes and index accordingly.

Another time, we built a materialized view for marketing reports and refreshed it hourly. The report users started complaining about missing today’s data. The fix was to supplement the view with a small “hot cache” table updated by events within seconds, blending batch and real-time. This hybrid approach is often more practical than forcing a single strategy.

I’ve also seen event sourcing used for a billing system where auditability was non-negotiable. Early on, we didn’t version event schemas. When the team changed a payload structure, consumers broke. We solved it by introducing a simple schema registry and event envelope with a version field. It added a bit of overhead but saved many debugging sessions.

Getting started: tooling and workflow

Tooling for relational modeling

  • Postgres is a solid foundation. Use pgx in Go or asyncpg in Python for efficient access.
  • Migrations: golang-migrate or Flyway. Keep migrations small and reversible.
  • Indexing: Use EXPLAIN liberally. Consider partial indexes for common filters and GIN indexes for JSONB.

Workflow for event-driven models

  • Use Kafka or a managed equivalent for events. Start with a single topic per aggregate (e.g., order_events).
  • Consumers should be idempotent. Include event IDs and handle duplicates gracefully.
  • Track offsets and add dead-letter queues for failed messages.

Local development with Docker

# docker-compose.yml
version: "3.9"
services:
  postgres:
    image: postgres:15-alpine
    environment:
      POSTGRES_USER: dev
      POSTGRES_PASSWORD: dev
      POSTGRES_DB: ordersdb
    ports:
      - "5432:5432"
    volumes:
      - pgdata:/var/lib/postgresql/data

  kafka:
    image: confluentinc/cp-kafka:latest
    depends_on:
      - zookeeper
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:29092,PLAINTEXT_HOST://localhost:9092
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
    ports:
      - "9092:9092"

  zookeeper:
    image: confluentinc/cp-zookeeper:latest
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181
      ZOOKEEPER_TICK_TIME: 2000

volumes:
  pgdata:

Run docker-compose up -d to start local dependencies. Apply migrations with golang-migrate:

migrate -path migrations -database "postgres://dev:dev@localhost:5432/ordersdb?sslmode=disable" up

Project setup notes

  • Keep configuration in environment variables and a .env file for local dev.
  • Structure code around domain boundaries. Avoid leaking database details into handlers.
  • Use interfaces for external dependencies (event producer, DB) to ease testing.
  • Instrument slow queries and event processing latencies with OpenTelemetry or similar.

Distinguishing features and developer experience

What makes this approach stand out is clarity: your write model protects invariants, your read model is optimized for usage, and your event layer captures change without coupling services. Developer experience improves because:

  • Migrations are explicit and versioned, making rollouts predictable.
  • Event-driven updates isolate failures and allow replay.
  • Materialized views reduce query complexity and keep dashboards responsive.

Maintainability also benefits. If you need to introduce a new analytics pipeline, you can consume existing events without changing the core transactional code. If a query pattern changes, you add or adjust a read model rather than rewriting the entire service.

Free learning resources

Summary and takeaway

Data modeling for modern applications is about mapping access patterns to storage strengths, balancing consistency with performance, and designing for change. If you’re building transactional services with clear boundaries, a normalized relational write model with event-driven updates and targeted read models will serve you well. If your domain is relationship-heavy or analytics-centric, consider graph or columnar stores to complement the core system.

Who should adopt this approach:

  • Teams building microservices that need transactional integrity and downstream analytics.
  • Systems requiring audit trails and compliance, benefiting from event sourcing and CDC.
  • Applications with mixed read/write patterns, where materialized views and caching reduce load.

Who might skip it:

  • Projects that need cross-service distributed transactions as a primary pattern; consider alternatives that embrace eventual consistency and compensating actions.
  • Very small prototypes where premature optimization and event infrastructure add overhead.
  • Domains where a graph model or columnar warehouse is the primary workload; choose tools that natively fit those shapes.

A grounded takeaway: start by writing down your primary access patterns and invariants. Choose a write model that protects invariants simply. Add a read model that serves the hottest queries. Capture events for change propagation. Iterate. The best data model is the one that supports your current workload while leaving the door open for the next feature, not the one that tries to anticipate every future need on day one.