Infrastructure as Code Best Practices
Why predictable, versioned infrastructure is no longer optional for modern software delivery

We used to manage servers by clicking through web consoles and keeping hand written notes about firewall rules. It worked, until it did not. A single missed checkbox or a well intentioned change made under pressure could take down production for hours. Infrastructure as Code changed that. It replaced fragile manual steps with version controlled, testable, and repeatable definitions of the infrastructure your applications depend on.
If you are a developer or a curious engineer who already uses Git and automated tests, you might wonder why you should care deeply about IaC best practices. The short answer is that infrastructure is now part of your application. The long answer is that the way you structure, test, and deploy infrastructure code has a direct impact on reliability, security, and delivery speed. This post draws from real project experience, not just documentation. It covers practical patterns, tradeoffs, and mental models that help teams move faster without stepping on each other's toes.
You can expect a grounded walkthrough of concepts, concrete code and folder examples, honest pros and cons, and a personal look at what works and what hurts. By the end, you will know when IaC is the right fit, which patterns to adopt first, and where it might be overkill.
Context: Where IaC fits today and how teams actually use it
Infrastructure as Code is not a single tool, it is a discipline. You define cloud resources, networking, and deployment pipelines in files that live in version control. You review them, test them, and apply them through automated pipelines. In practice, teams use IaC to stand up environments for development, staging, and production. They use it to create consistent databases, queues, object storage, IAM roles, and Kubernetes clusters. It is common to combine IaC with CI/CD, so changes move through a pipeline that runs plan or diff steps before applying.
The most common categories of IaC tools today are:
- Declarative tools that describe the desired state, such as Terraform for general cloud infrastructure, AWS CloudFormation, Azure Resource Manager templates, and Google Cloud Deployment Manager.
- Configuration management tools that converge a system toward a target state, such as Ansible, Chef, or Puppet, often used for OS level setup and application deployment.
- Platform specific tools like Pulumi or the AWS CDK, which let you define infrastructure using general purpose programming languages.
In real projects, you will often see a blend. Terraform or CloudFormation sets up the cloud foundation. Ansible configures VM images or bootstraps nodes. Kubernetes manifests and Helm charts manage containerized workloads. GitOps practices like Argo CD or Flux apply changes by watching Git repositories, bringing the same version control rigor to Kubernetes resources.
Who uses these tools? Platform engineers, DevOps specialists, SREs, and increasingly application developers who own their services end to end. The choice often hinges on team skills, the cloud provider, and operational maturity. Compared to alternatives, like manual changes or scripts, IaC provides auditability and repeatability. Manual changes scale poorly and lead to drift. Scripts are imperative and can be brittle. Declarative IaC expresses intent, and the tool figures out how to achieve it, which is easier to maintain long term.
Core principles and practical patterns
Keep a single source of truth and embrace version control
Your infrastructure definitions should live in Git. There should be a clear mapping between your code repository and the environments you deploy. Do not store state files or secrets in the repo, but everything else, including modules, configuration, and pipelines, belongs there. A typical monorepo for a platform team might look like this:
infra/
├── environments/
│ ├── dev/
│ │ ├── main.tf
│ │ ├── variables.tfvars
│ │ └── outputs.tf
│ ├── staging/
│ │ ├── main.tf
│ │ ├── variables.tfvars
│ │ └── outputs.tf
│ └── prod/
│ ├── main.tf
│ ├── variables.tfvars
│ └── outputs.tf
├── modules/
│ ├── vpc/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── outputs.tf
│ ├── ecs/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── outputs.tf
│ └── rds/
│ ├── main.tf
│ ├── variables.tf
│ └── outputs.tf
├── pipelines/
│ ├── .github/workflows/deploy.yml
│ └── scripts/
│ ├── plan.sh
│ └── apply.sh
└── README.md
Commit messages should describe the intent and impact, not just the file names. Enforce branch protections and require reviews for changes to production. If you use trunk based development, combine feature branches with short lived environment branches or feature flags to reduce risk.
Separate configuration from code and manage secrets securely
Hardcoding values in your IaC is a common mistake that leads to fragile code. Configuration such as region, instance sizes, and environment names should be in variable files or a configuration service. Secrets must never be stored in repositories. Use a dedicated secret manager, like AWS Secrets Manager, HashiCorp Vault, or Azure Key Vault, and reference them in your IaC. Terraform can fetch secrets at apply time using data sources, and Pulumi can integrate with providers. For example, a Terraform data source for a secret:
data "aws_secretsmanager_secret_version" "db_creds" {
secret_id = "prod/db/credentials"
}
resource "aws_db_instance" "main" {
identifier = "prod-db"
engine = "postgres"
instance_class = "db.t3.medium"
allocated_storage = 20
username = jsondecode(data.aws_secretsmanager_secret_version.db_creds.secret_string)["username"]
password = jsondecode(data.aws_secretsmanager_secret_version.db_creds.secret_string)["password"]
}
Keep non secret configuration in version controlled files, but inject environment specific values through tfvars, environment variables, or a configuration service. In larger teams, a central configuration service can reduce duplication and enforce standards.
Design reusable modules with clear interfaces
Modules are the building blocks of maintainable IaC. A module should have a small, stable interface and hide implementation details. For example, a VPC module might expose only CIDR blocks, subnets, and NAT gateways, while hiding the details of route tables and security groups.
Terraform example module interface for a VPC:
# modules/vpc/variables.tf
variable "cidr_block" {
type = string
description = "VPC CIDR block"
}
variable "public_subnet_cidrs" {
type = list(string)
description = "Public subnet CIDR blocks"
}
variable "private_subnet_cidrs" {
type = list(string)
description = "Private subnet CIDR blocks"
}
Inside the module, create subnets and routes. Keep outputs minimal and meaningful:
# modules/vpc/outputs.tf
output "vpc_id" {
value = aws_vpc.main.id
}
output "public_subnet_ids" {
value = aws_subnet.public[*].id
}
output "private_subnet_ids" {
value = aws_subnet.private[*].id
}
In an environment, you consume the module:
# environments/dev/main.tf
module "vpc" {
source = "../../modules/vpc"
cidr_block = "10.10.0.0/16"
public_subnet_cidrs = ["10.10.1.0/24", "10.10.2.0/24"]
private_subnet_cidrs = ["10.10.11.0/24", "10.10.12.0/24"]
}
This pattern scales well. As the platform grows, you can version modules and adopt a registry, but start simple with a clear folder structure and consistent variable naming.
Treat infrastructure like software: test, validate, and review
You would not deploy code without tests. Infrastructure deserves the same care. Use a layered testing approach:
- Static validation: Linting and policy checks. For Terraform, terraform fmt and tflint catch formatting issues and common mistakes.
- Plan reviews: Use pull request workflows where diffs are reviewed. Tools like terraform plan or cloud provider change sets show exactly what will be created, changed, or destroyed.
- Automated tests: For critical modules, spin up a temporary environment and run checks. Tools like Terratest let you write tests in Go that run terraform apply, then assert properties of the created resources.
A simple Terratest example for an S3 bucket module:
package test
import (
"testing"
"github.com/gruntwork-io/terratest/modules/aws"
"github.com/gruntwork-io/terratest/modules/terraform"
)
func TestS3BucketCreated(t *testing.T) {
t.Parallel()
terraformOptions := &terraform.Options{
TerraformDir: "../modules/s3",
Vars: map[string]interface{}{
"bucket_name": "example-bucket-" + random.UniqueId(),
},
}
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
bucketName := terraform.Output(t, terraformOptions, "bucket_id")
aws.AssertS3BucketExists(t, "us-east-1", bucketName)
}
This is not about writing tests for every resource. Focus on modules and paths that represent shared building blocks. A test that confirms a database is private and encrypted catches costly misconfigurations early.
Plan for change and avoid destruction by default
Apply changes carefully, especially in production. The plan step is your friend. In CI pipelines, generate a plan during pull requests and gate the apply step behind approvals. For sensitive environments, require two person review or on call approval. Avoid destructive changes by default. If a resource must be replaced, design your code to handle it gracefully, for example by using immutable naming patterns or by supporting blue green deployments.
Manage state with care
State files store a mapping between your code and real infrastructure. Losing or corrupting state can be disastrous. Store state in a remote backend with locking. Terraform supports remote backends like S3 with DynamoDB for locking. For example, a backend configuration:
terraform {
backend "s3" {
bucket = "my-org-terraform-state"
key = "envs/prod/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-state-lock"
encrypt = true
}
}
Never commit state files to Git. Use access controls to limit who can read or modify state storage. If you work with multiple teams, consider splitting state by domain to reduce the blast radius of changes.
Embrace immutability and drift detection
Immutability means you do not edit resources in place. Instead, you replace them with new versions. This reduces configuration drift and makes changes predictable. For example, when changing an AMI for an EC2 instance, create a new launch template and roll out instances gradually.
Detect drift regularly. Cloud provider APIs can report drift, and tools like terraform plan will surface changes outside of IaC. Schedule drift detection runs and investigate discrepancies quickly. Drift is often a sign of manual fixes or shadow IT.
Integrate with CI/CD and adopt GitOps for Kubernetes
A healthy pipeline is the backbone of reliable IaC. Typical stages include:
- Lint and validate
- Plan and generate change summary
- Run automated tests for modules
- Apply to lower environments automatically
- Require approval for production
If you use Kubernetes, GitOps tools like Argo CD or Flux continuously reconcile the cluster state with what is declared in Git. This avoids manual kubectl apply commands and provides an audit trail. Example repository layout for GitOps:
gitops/
├── apps/
│ ├── payment/
│ │ ├── kustomization.yaml
│ │ └── deployment.yaml
│ └── orders/
│ ├── kustomization.yaml
│ └── deployment.yaml
└── infrastructure/
├── nginx-ingress/
│ ├── kustomization.yaml
│ └── ingress.yaml
└── cert-manager/
├── kustomization.yaml
└── clusterissuer.yaml
In practice, you keep application manifests separate from cluster add ons. Promotion between environments is done by updating image tags or kustomize overlays, and Argo CD detects changes and syncs them.
Security and compliance by design
Security is not a checkbox. Embed it in your IaC patterns:
- Least privilege: Create IAM roles with narrowly scoped policies. Avoid wildcard actions and resources.
- Encryption at rest and in transit: Enable by default for databases, object storage, and message queues.
- Network segmentation: Use private subnets for databases and internal services, restrict ingress with security groups.
- Secrets management: Reference secrets from a manager, never store them in code.
- Policy as code: Use Open Policy Agent or cloud native policy engines to enforce guardrails. For example, require all S3 buckets to block public access and have encryption enabled.
A minimal OPA rule to require S3 encryption:
package terraform
deny[msg] {
resource := input.resource.aws_s3_bucket[name]
not resource.server_side_encryption_configuration
msg := sprintf("S3 bucket %s must have server side encryption enabled", [name])
}
Run policy checks in CI to block non compliant changes early.
Cost awareness and resource tagging
IaC makes it easy to create resources. It also makes it easy to forget them. Tag resources with environment, owner, and cost center. Use cloud provider budgets and alerts. In your IaC, standardize tagging:
locals {
common_tags = {
Environment = var.env
Project = "payments"
Owner = "platform-team"
CostCenter = "cc-1234"
}
}
resource "aws_s3_bucket" "logs" {
bucket = "payments-logs-${var.env}"
tags = local.common_tags
}
Add automated cleanup for temporary environments. For example, spin up ephemeral preview environments for pull requests and destroy them when the PR closes.
Structuring for scale and collaboration
As teams grow, you will hit questions about repository structure. Two common patterns:
- Monorepo: Single repo for all infrastructure. Pros include shared modules and consistent CI. Cons include larger blast radius and tighter coupling.
- Polyrepo: Separate repos per domain or team. Pros include autonomy and faster iteration. Cons include duplication and coordination overhead.
Start with a monorepo and move to polyrepo when domain boundaries are clear and you need independence. For multi cloud, avoid deep provider coupling in shared modules. Use abstraction layers that hide provider specifics, and keep environment configs separate.
Honest evaluation: strengths, weaknesses, and tradeoffs
Strengths:
- Consistency and repeatability: Environments are created the same way every time.
- Auditability and compliance: Git history shows who changed what and when.
- Speed and scalability: Teams can spin up and tear down environments quickly.
- Early detection of issues: Plan and policy checks catch problems before deployment.
Weaknesses:
- Learning curve: Understanding cloud providers, IAM, networking, and tooling takes time.
- State management complexity: Remote backends, locking, and drift require careful handling.
- Initial setup overhead: Writing modules and pipelines takes effort up front.
- Tool fragmentation: Mixing Terraform, Helm, Ansible, and GitOps tools can be overwhelming.
When IaC might not be the right fit:
- Short lived experiments or one off tasks where manual changes are faster and acceptable.
- Very small projects with minimal infrastructure and low risk.
- Highly regulated environments with strict change controls where you need dedicated governance teams and might rely more on approved templates than ad hoc IaC.
The tradeoff is between initial investment and long term reliability. The best fit is usually teams that deploy frequently, manage multiple environments, or require strict compliance.
Personal experience: Lessons from the trenches
I started using IaC out of necessity. A production database was manually resized, and the change caused downtime because the maintenance window was misaligned. We had no rollback plan and no record of who made the change. That incident led us to adopt Terraform for the entire stack.
One of the biggest lessons was the importance of small, composable modules. In one project, we tried to create a monolithic module that handled VPC, databases, and compute in one place. Changes were risky, and testing was painful. Splitting it into focused modules made reviews easier and reduced errors. We paired each module with a minimal test suite, and the difference in reliability was noticeable.
Another surprise was how often small misconfigurations caused outages. We had a security group rule that allowed 0.0.0.0/0 on SSH temporarily, with a plan to tighten it later. A script hit the server from an unknown IP, and we had to scramble. Now, we default deny and only allow specific IPs. We also added policy checks that block any security group with broad CIDR ranges in production.
State management also bit us. In early days, we kept state files in Git and had a conflict that overwrote the state for a staging environment. Rebuilding was painful. Moving to S3 with DynamoDB locking and versioned state eliminated that risk. We also added nightly drift detection and an alert for any manual changes.
On the developer experience side, IaC felt foreign to application developers at first. The shift from writing functions to defining resources was mental whiplash. We addressed it by pairing platform engineers with developers during reviews, and by providing clear examples and templates. Over time, developers started reading plans like code reviews, and that shared understanding improved the quality of both applications and infrastructure.
One moment stands out. A customer reported slow API responses on a Friday afternoon. We suspected the database, but the evidence was fuzzy. Because we had IaC, we could quickly create a copy of the production environment for debugging, route traffic to it for a subset of users, and compare metrics. Within an hour, we identified a missing index and rolled out a fix. The ability to clone infrastructure on demand saved the weekend.
Getting started: Tooling, workflow, and mental models
You do not need to adopt everything at once. Start with the core workflow and expand as your needs grow.
- Choose your tooling: If you primarily work with AWS, Azure, or GCP, Terraform is a safe bet. If you want to use a general purpose language, try Pulumi or AWS CDK. For configuration management of VMs, Ansible is approachable. For Kubernetes, learn kubectl and kustomize first, then Helm and GitOps.
- Set up a remote backend: Configure storage and locking for state files. This is not optional for teams.
- Define your environments: Start with dev and staging. Keep production simple and locked down until your pipelines are mature.
- Establish CI/CD: Automate lint, plan, test, and apply. Protect production with approvals.
- Create a module library: Build small, reusable modules for common patterns like VPC, databases, and compute. Document inputs, outputs, and examples.
- Add policy and security checks: Integrate static analysis and policy as code early. It is easier to build habits than to retrofit them.
A typical CI workflow for Terraform might look like this in GitHub Actions:
name: Terraform Plan and Apply
on:
push:
branches: [ main ]
pull_request:
branches: [ main ]
jobs:
terraform:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
- name: Terraform Fmt
run: terraform fmt -check -recursive
- name: Terraform Init
run: |
cd environments/dev
terraform init
- name: Terraform Validate
run: |
cd environments/dev
terraform validate
- name: Terraform Plan
if: github.event_name == 'pull_request'
run: |
cd environments/dev
terraform plan -var-file=variables.tfvars -out=tfplan
- name: Terraform Apply
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
run: |
cd environments/dev
terraform apply -auto-approve -var-file=variables.tfvars
In practice, you will add caching, OIDC based cloud authentication, plan comments on PRs, and more. The key is to make the pipeline a helpful reviewer rather than a black box.
Mental models to adopt:
- Treat infrastructure as a product: You have customers, support requests, and release cycles.
- Design for change: Assume you will need to replace or update everything eventually.
- Prefer explicit over implicit: Clear names, clear inputs, clear outputs.
- Make failure visible: Logs, metrics, and alerts should be part of your IaC design.
What makes IaC stand out compared to manual changes or imperative scripts
- Predictability: You see the full plan before applying. Manual changes are blind to side effects.
- Collaboration: Pull requests and reviews spread knowledge and catch errors.
- Repeatability: You can recreate environments on demand. This is crucial for disaster recovery.
- Audit trail: Git history is immutable and searchable. Manual changes leave no reliable record.
Compared to imperative scripts, declarative IaC expresses intent and avoids brittle step by step logic. Compared to platform specific templates, general purpose tools like Terraform and Pulumi offer better portability and developer ergonomics. However, vendor native solutions like CloudFormation may integrate better with specific cloud features or support. Choose based on your constraints and skills.
Free learning resources
- Terraform Documentation: https://www.terraform.io/docs Authoritative guide to syntax, state, backends, and best practices.
- AWS CloudFormation Documentation: https://docs.aws.amazon.com/cloudformation Good for understanding AWS native IaC and stacks.
- Pulumi Documentation: https://www.pulumi.com/docs Learn infrastructure using general purpose languages like TypeScript, Python, and Go.
- Ansible Documentation: https://docs.ansible.com Practical configuration management for servers and automation.
- Open Policy Agent Documentation: https://www.openpolicyagent.org/docs Policy as code for guardrails in CI and deployment workflows.
- Argo CD Documentation: https://argo-cd.readthedocs.io GitOps for Kubernetes with practical deployment patterns.
- Terratest Documentation: https://terratest.gruntwork.io Testing Terraform and infrastructure with real resources.
- CNCF GitOps Principles: https://about.gitops.io High level guidance on GitOps workflows and benefits.
- AWS Best Practices for Tagging: https://docs.aws.amazon.com/general/latest/gr/aws_tagging.html Practical guidance on consistent resource tagging for cost and governance.
- HashiCorp Learn: https://learn.hashicorp.com/terraform Hands on tutorials for core concepts and workflows.
Who should use IaC and who might skip it
Use IaC if:
- You deploy applications regularly and need consistent environments.
- You manage multiple environments or cloud providers.
- Your team values auditability, compliance, and reproducibility.
- You want to enable faster delivery while reducing operational risk.
Consider skipping or delaying IaC if:
- Your infrastructure is tiny, static, and rarely changes.
- You are in a short term experiment or hackathon where speed is paramount.
- You lack the time or support to set up remote state, pipelines, and reviews.
- Your compliance environment forbids automation until specific controls are in place.
A grounded takeaway is that IaC is a force multiplier for teams that care about reliability and speed. It is not a silver bullet, and it requires discipline, but the payoff is substantial. Start small with a single module and a pipeline, grow your library over time, and keep refining your practices as your team and systems evolve. The best infrastructure code is the kind your future self will thank you for reading, reviewing, and running without fear.




