Infrastructure as Code Best Practices: Beyond the Basics

9 min read 1929 words

Table of Contents

Infrastructure as Code (IaC) has revolutionized how organizations manage their cloud resources, enabling teams to provision and manage infrastructure through machine-readable definition files rather than manual processes. While most teams have adopted basic IaC practices, many struggle to implement the advanced patterns and workflows that lead to truly maintainable, secure, and efficient infrastructure management.

This comprehensive guide explores advanced Infrastructure as Code best practices that go beyond the basics. We’ll cover strategies for testing, security, modularity, team workflows, and more—all designed to help you elevate your IaC implementation from functional to exceptional. Whether you’re using Terraform, CloudFormation, Pulumi, or another IaC tool, these principles will help you build more robust infrastructure management capabilities.


Beyond Basic IaC: The Maturity Model

Before diving into specific practices, it’s helpful to understand the IaC maturity model—a framework for assessing and improving your IaC implementation.

The Four Levels of IaC Maturity

Level 1: Manual with Some Automation

  • Infrastructure defined in code but often modified manually
  • Limited version control
  • Ad-hoc testing
  • Minimal documentation

Level 2: Basic IaC Implementation

  • Infrastructure fully defined in code
  • Code stored in version control
  • Manual approval processes
  • Basic documentation
  • Limited testing

Level 3: Advanced IaC Implementation

  • Modular, reusable code
  • Automated testing
  • CI/CD pipeline integration
  • Comprehensive documentation
  • Security scanning

Level 4: Optimized IaC Implementation

  • Self-service infrastructure
  • Comprehensive testing strategy
  • Automated compliance and security
  • Cost optimization
  • Continuous improvement process

This guide focuses on practices that help organizations move from Level 2 to Levels 3 and 4.


Code Organization and Modularity

Well-organized IaC code is easier to maintain, understand, and reuse. Here are best practices for structuring your infrastructure code:

Modular Architecture

Break your infrastructure code into logical, reusable modules:

Terraform Example:

infrastructure/
├── modules/
   ├── networking/
      ├── main.tf
      ├── variables.tf
      ├── outputs.tf
      └── README.md
   ├── database/
      ├── main.tf
      ├── variables.tf
      ├── outputs.tf
      └── README.md
   └── compute/
       ├── main.tf
       ├── variables.tf
       ├── outputs.tf
       └── README.md
├── environments/
   ├── dev/
      ├── main.tf
      ├── variables.tf
      └── terraform.tfvars
   ├── staging/
      ├── main.tf
      ├── variables.tf
      └── terraform.tfvars
   └── production/
       ├── main.tf
       ├── variables.tf
       └── terraform.tfvars
└── global/
    ├── iam/
       └── main.tf
    └── dns/
        └── main.tf

Key Principles for Modularity

  1. Single Responsibility: Each module should do one thing well
  2. Encapsulation: Hide internal implementation details
  3. Interface Stability: Maintain stable input/output interfaces
  4. Documentation: Document purpose, usage, and examples
  5. Versioning: Version modules to manage changes

Environment Separation

Maintain clear separation between environments while maximizing code reuse:

Terraform Example:

# environments/dev/main.tf
provider "aws" {
  region = "us-west-2"
}

module "networking" {
  source = "../../modules/networking"
  
  environment = "dev"
  vpc_cidr    = "10.0.0.0/16"
  subnet_bits = 8
}

module "database" {
  source = "../../modules/database"
  
  environment     = "dev"
  instance_type   = "db.t3.medium"
  storage_gb      = 50
  subnet_ids      = module.networking.private_subnet_ids
  security_groups = [module.networking.db_security_group_id]
}

Remote State Management

Properly manage state files to enable collaboration and reduce risks:

Terraform Example:

# Configure remote state storage
terraform {
  backend "s3" {
    bucket         = "company-terraform-states"
    key            = "environments/dev/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}

Testing Infrastructure Code

Testing is often overlooked in IaC implementations, but it’s crucial for reliability and confidence.

Types of IaC Tests

  1. Static Analysis: Validate syntax, style, and best practices
  2. Unit Testing: Test individual modules or components
  3. Integration Testing: Test interactions between components
  4. End-to-End Testing: Test complete infrastructure deployments
  5. Security Testing: Scan for security issues and compliance violations

Static Analysis Tools

Terraform Example:

# Run terraform fmt to standardize code style
terraform fmt -recursive

# Run terraform validate to check syntax
terraform validate

# Run tflint for additional linting
tflint

# Run checkov for security and compliance checks
checkov -d .

Unit Testing

Terraform Example with Terratest:

// test/vpc_test.go
package test

import (
    "testing"
    "github.com/gruntwork-io/terratest/modules/terraform"
    "github.com/stretchr/testify/assert"
)

func TestVpcModule(t *testing.T) {
    terraformOptions := terraform.WithDefaultRetryableErrors(t, &terraform.Options{
        TerraformDir: "../modules/networking",
        Vars: map[string]interface{}{
            "environment": "test",
            "vpc_cidr":    "10.0.0.0/16",
            "subnet_bits": 8,
        },
    })

    // Clean up resources when the test is complete
    defer terraform.Destroy(t, terraformOptions)

    // Deploy the infrastructure
    terraform.InitAndApply(t, terraformOptions)

    // Validate the outputs
    vpcId := terraform.Output(t, terraformOptions, "vpc_id")
    assert.NotEmpty(t, vpcId, "VPC ID should not be empty")
    
    subnetCount := terraform.Output(t, terraformOptions, "public_subnet_count")
    assert.Equal(t, "3", subnetCount, "Should create 3 public subnets")
}

Security and Compliance in IaC

Security should be integrated throughout your IaC workflow, not added as an afterthought.

Secure Coding Practices

  1. Least Privilege: Grant minimal permissions required
  2. Encryption: Encrypt data at rest and in transit
  3. Network Segmentation: Implement proper network boundaries
  4. Secret Management: Never hardcode secrets in IaC files

Terraform Example:

# Bad practice - hardcoded secrets
resource "aws_db_instance" "database" {
  username = "admin"
  password = "supersecretpassword"  # Don't do this!
}

# Good practice - use secret management
resource "aws_db_instance" "database" {
  username = "admin"
  password = data.aws_secretsmanager_secret_version.db_password.secret_string
}

data "aws_secretsmanager_secret_version" "db_password" {
  secret_id = "db/password"
}

Automated Security Scanning

Integrate security scanning into your CI/CD pipeline:

Example GitHub Actions Workflow:

name: 'Infrastructure Security Scan'

on:
  pull_request:
    paths:
      - '**.tf'
      - '**.yaml'
      - '**.json'

jobs:
  security-scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v2
      
      - name: Terraform Format Check
        run: terraform fmt -check -recursive
      
      - name: Terraform Validate
        run: |
          terraform init -backend=false
          terraform validate          
      
      - name: Run tfsec
        uses: aquasecurity/[email protected]
        with:
          soft_fail: false
      
      - name: Run checkov
        uses: bridgecrewio/checkov-action@master
        with:
          directory: .
          framework: terraform
          soft_fail: false

Policy as Code

Implement policy as code to enforce security and compliance requirements:

Example with Open Policy Agent (OPA):

# policy/terraform/security.rego
package terraform.security

# Deny S3 buckets without encryption
deny[msg] {
    resource := input.resource.aws_s3_bucket[name]
    not resource.server_side_encryption_configuration
    msg := sprintf("S3 bucket '%v' is missing server-side encryption", [name])
}

# Deny public S3 buckets
deny[msg] {
    resource := input.resource.aws_s3_bucket[name]
    acl := resource.acl
    acl == "public-read" or acl == "public-read-write"
    msg := sprintf("S3 bucket '%v' has public access enabled through ACL", [name])
}

# Require VPC flow logs
deny[msg] {
    resource := input.resource.aws_vpc[name]
    not input.resource.aws_flow_log[_].vpc_id
    msg := sprintf("VPC '%v' is missing flow logs", [name])
}

Deployment Strategies and CI/CD Integration

Effective deployment strategies are crucial for reliable infrastructure management.

Continuous Integration for IaC

Set up CI pipelines to validate infrastructure code on every change:

Example GitLab CI Configuration:

stages:
  - validate
  - plan
  - apply

variables:
  TF_ROOT: ${CI_PROJECT_DIR}/environments/dev

validate:
  stage: validate
  script:
    - cd ${TF_ROOT}
    - terraform init -backend=false
    - terraform validate
    - terraform fmt -check -recursive
    - tflint
    - checkov -d .
  rules:
    - if: '$CI_PIPELINE_SOURCE == "merge_request_event"'

plan:
  stage: plan
  script:
    - cd ${TF_ROOT}
    - terraform init
    - terraform plan -out=tfplan
  artifacts:
    paths:
      - ${TF_ROOT}/tfplan
  rules:
    - if: '$CI_PIPELINE_SOURCE == "merge_request_event"'

apply:
  stage: apply
  script:
    - cd ${TF_ROOT}
    - terraform init
    - terraform apply -auto-approve tfplan
  dependencies:
    - plan
  rules:
    - if: '$CI_COMMIT_BRANCH == "main"'
  when: manual

Progressive Deployment Strategies

Implement progressive deployment to reduce risk:

  1. Environment Promotion: Deploy changes through dev → staging → production
  2. Canary Deployments: Deploy changes to a subset of resources first
  3. Blue/Green Deployments: Create new infrastructure before switching traffic

Rollback Strategies

Plan for failures with effective rollback strategies:

  1. State Backups: Regularly back up state files
  2. Version Pinning: Pin module and provider versions
  3. Incremental Changes: Make small, reversible changes
  4. Automated Rollbacks: Implement automated rollback on failure

Terraform Example:

# Pin provider versions
terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 4.16.0"
    }
  }
}

# Pin module versions
module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "3.14.0"
  
  # Module configuration...
}

Documentation and Knowledge Sharing

Documentation is often neglected but is crucial for team collaboration and knowledge transfer.

Self-Documenting Code

Write code that explains itself:

Terraform Example:

# Well-structured, self-documenting code
resource "aws_security_group" "web_server" {
  name        = "${var.environment}-web-server-sg"
  description = "Security group for web servers in ${var.environment} environment"
  vpc_id      = var.vpc_id
  
  # Allow HTTP from anywhere
  ingress {
    description = "HTTP from anywhere"
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }
  
  # Allow HTTPS from anywhere
  ingress {
    description = "HTTPS from anywhere"
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }
  
  # Allow SSH only from internal network
  ingress {
    description = "SSH from internal network"
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = [var.internal_cidr]
  }
  
  # Allow all outbound traffic
  egress {
    description = "All outbound traffic"
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
  
  tags = merge(
    var.common_tags,
    {
      Name = "${var.environment}-web-server-sg"
      Role = "Web Server"
    }
  )
}

Module Documentation

Create comprehensive documentation for reusable modules:

Example README.md for a Module:

# Networking Module

This module creates a complete networking stack including VPC, subnets, route tables, and security groups.

## Usage

```hcl
module "networking" {
  source = "github.com/company/terraform-modules//networking?ref=v1.2.0"
  
  environment     = "production"
  vpc_cidr        = "10.0.0.0/16"
  azs             = ["us-west-2a", "us-west-2b", "us-west-2c"]
  private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  public_subnets  = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
}

Inputs

NameDescriptionTypeDefaultRequired
environmentEnvironment name (e.g., dev, staging, production)stringn/ayes
vpc_cidrCIDR block for the VPCstringn/ayes
azsList of availability zoneslist(string)n/ayes
private_subnetsList of private subnet CIDR blockslist(string)n/ayes
public_subnetsList of public subnet CIDR blockslist(string)n/ayes
enable_nat_gatewayWhether to create NAT Gatewaysbooltrueno
single_nat_gatewayWhether to use a single NAT Gatewayboolfalseno

Outputs

NameDescription
vpc_idThe ID of the VPC
private_subnet_idsList of private subnet IDs
public_subnet_idsList of public subnet IDs
nat_gateway_ipsList of NAT Gateway IPs

---

### Cost Optimization and Efficiency

Optimize your infrastructure for cost without sacrificing reliability or performance.

#### Cost Tagging Strategy

Implement comprehensive tagging for cost allocation:

**Terraform Example:**

```hcl
# Define common tags
locals {
  common_tags = {
    Environment = var.environment
    Project     = var.project
    Owner       = var.team
    ManagedBy   = "terraform"
  }
}

# Apply tags to all resources
resource "aws_instance" "web" {
  ami           = var.ami_id
  instance_type = var.instance_type
  
  tags = merge(
    local.common_tags,
    {
      Name = "${var.environment}-web-server"
      Role = "Web"
    }
  )
}

Resource Optimization

Implement patterns for efficient resource usage:

  1. Right-sizing: Use appropriate instance sizes
  2. Auto-scaling: Scale resources based on demand
  3. Spot Instances: Use spot instances for non-critical workloads
  4. Reserved Instances: Purchase reserved instances for stable workloads

Team Workflows and Collaboration

Effective team workflows are essential for collaborative infrastructure management.

GitOps Workflow

Implement GitOps for infrastructure changes:

  1. Infrastructure as Code: All infrastructure defined in code
  2. Git as Single Source of Truth: All changes go through Git
  3. Pull Request Workflow: Changes reviewed before application
  4. Automated Deployment: Changes automatically applied after approval

Collaborative Development Practices

Foster collaboration in infrastructure development:

  1. Code Reviews: Require reviews for all infrastructure changes
  2. Pair Programming: Collaborate on complex infrastructure changes
  3. Knowledge Sharing: Regular tech talks and documentation updates
  4. Cross-Training: Ensure multiple team members understand each component

Conclusion: Building a Culture of Infrastructure Excellence

Implementing advanced IaC practices is as much about culture as it is about technology. To truly excel with Infrastructure as Code:

  1. Invest in Learning: Continuously improve your team’s IaC skills
  2. Embrace Automation: Automate repetitive tasks and validations
  3. Prioritize Quality: Treat infrastructure code with the same rigor as application code
  4. Foster Collaboration: Break down silos between development and operations
  5. Iterate and Improve: Regularly review and refine your IaC practices

By following these best practices, you can transform your Infrastructure as Code implementation from a basic automation tool to a strategic advantage that enables faster innovation, improved reliability, and better cost management.

Remember that IaC maturity is a journey, not a destination. Start where you are, implement improvements incrementally, and continuously evolve your practices as your team and technology landscape change. The investment in advanced IaC practices will pay dividends in the form of more reliable infrastructure, faster deployments, and reduced operational overhead.

Andrew
Andrew

Andrew is a visionary software engineer and DevOps expert with a proven track record of delivering cutting-edge solutions that drive innovation at Ataiva.com. As a leader on numerous high-profile projects, Andrew brings his exceptional technical expertise and collaborative leadership skills to the table, fostering a culture of agility and excellence within the team. With a passion for architecting scalable systems, automating workflows, and empowering teams, Andrew is a sought-after authority in the field of software development and DevOps.

Tags