The Complete Guide to Production EKS with Terraform

Q: What is the best way to deploy EKS with Terraform?

The best approach uses the terraform-aws-modules/eks/aws community module combined with custom configurations for Karpenter, EKS addons, and security hardening. Deploy the VPC, EKS cluster, and node groups as separate Terraform modules with clear dependency management.

Q: How do I secure an EKS cluster for production?

Secure production EKS with: private API endpoint, IRSA for pod-level IAM, Pod Security Standards enforcement, network policies with Calico or Cilium, envelope encryption for secrets with KMS, audit logging to CloudWatch, OPA Gatekeeper for policy enforcement, and regular CIS benchmark scanning.

Q: What EKS addons should I install for production?

Essential production EKS addons include: vpc-cni (networking), coredns (DNS), kube-proxy (networking), aws-ebs-csi-driver (storage), aws-efs-csi-driver (shared storage), and adot (observability). Install them as EKS managed addons via Terraform for automatic version management and compatibility.

Q: How do I implement self-healing Kubernetes on EKS?

Self-healing EKS combines Karpenter for automatic node replacement, pod disruption budgets for safe rollouts, liveness and readiness probes for pod health, node problem detector for hardware issues, and automated remediation through AWS Systems Manager or custom controllers that detect and recover from failures.

Kehinde Ogunlowo | March 7, 2026 | 20 min read | Kubernetes & Infrastructure

Introduction: Production EKS with Terraform

This terraform EKS production guide covers everything you need to deploy, secure, and operate Amazon Elastic Kubernetes Service at enterprise scale. EKS has become the dominant Kubernetes platform on AWS, but the gap between a default EKS cluster and a production-ready deployment is substantial. Default configurations lack proper network isolation, autoscaling, security hardening, observability, and disaster recovery. Closing this gap requires deliberate architectural decisions codified in Terraform.

This guide addresses the complete lifecycle of a production EKS cluster: VPC design with proper subnet tagging, cluster configuration with private API endpoints, Karpenter-based autoscaling that responds in seconds, pod security using Pod Security Standards and IRSA, essential addon management, and self-healing patterns that automatically recover from node failures. Every configuration is implemented as Terraform code, ensuring repeatability across environments and full auditability of infrastructure changes.

The patterns presented here are extracted from production EKS deployments managed by Citadel Cloud Management, running workloads ranging from microservices platforms to data-intensive batch processing systems. The complete Terraform modules are available in our terraform-aws-eks repository, with auto-healing extensions in terraform-aws-auto-healing-eks.

The AWS EKS Best Practices Guide provides official recommendations that complement the practices described here. We reference it throughout this guide where relevant.

VPC Foundation for EKS

Every production EKS cluster begins with a properly designed VPC. The VPC architecture determines network isolation, IP address availability, load balancer placement, and cross-AZ resilience. Getting the VPC wrong creates problems that are expensive to fix after the cluster is running.

Subnet Architecture

A production EKS VPC requires three subnet tiers across at least three Availability Zones: public subnets for internet-facing load balancers (ALB, NLB), private subnets for worker nodes and pods, and isolated subnets (optional) for databases and sensitive workloads with no internet access. Each subnet tier must span all three AZs for high availability.

CIDR planning is critical. EKS worker nodes and pods consume IP addresses from the VPC CIDR. A common mistake is using a /24 or /22 CIDR that exhausts IPs as the cluster scales. For production, use at minimum a /16 VPC CIDR (65,536 IPs). With the VPC CNI's custom networking feature, you can assign pods to secondary CIDR ranges, effectively decoupling pod IP space from node IP space. Our terraform-aws-vpc-complete module implements this pattern with proper subnet tagging for EKS auto-discovery.

EKS Subnet Tags

EKS requires specific tags on subnets to discover them for node placement and load balancer provisioning. Public subnets must be tagged with kubernetes.io/role/elb = 1 for internet-facing load balancers. Private subnets must be tagged with kubernetes.io/role/internal-elb = 1 for internal load balancers. All subnets used by the cluster must be tagged with kubernetes.io/cluster/CLUSTER_NAME = shared (or owned if the subnet is exclusively for that cluster).

EKS Cluster Configuration

The EKS cluster itself requires careful configuration for production use. Key decisions include the Kubernetes version, API endpoint access, encryption, logging, and authentication mode.

API Endpoint Access

Production clusters should disable the public API endpoint and enable only the private endpoint. This ensures that the Kubernetes API is accessible only from within the VPC or through a VPN/Direct Connect connection. If you must keep the public endpoint enabled (for CI/CD pipelines running outside the VPC), restrict it to specific CIDR blocks using the public_access_cidrs parameter.

Envelope Encryption

EKS supports envelope encryption of Kubernetes secrets using AWS KMS. This adds a layer of encryption beyond the default etcd encryption at rest. With envelope encryption enabled, Kubernetes secrets are encrypted with a data encryption key (DEK), and the DEK itself is encrypted with a KMS customer-managed key (CMK). This is a compliance requirement for most regulated workloads.

Control Plane Logging

Enable all five EKS control plane log types for production: API server, audit, authenticator, controller manager, and scheduler. These logs are sent to CloudWatch Logs and are essential for security investigation, debugging, and compliance auditing. The audit log is particularly critical for tracking who did what in the cluster.

Karpenter Autoscaling for EKS

Karpenter has replaced Cluster Autoscaler as the recommended node autoscaler for EKS. Unlike Cluster Autoscaler, which operates at the Auto Scaling Group level and is constrained by pre-defined node group configurations, Karpenter provisions nodes directly through the EC2 Fleet API, selecting the optimal instance type, size, and purchase option for each pending pod in real time.

How Karpenter Works

Karpenter watches for unschedulable pods (pods in Pending state due to insufficient resources) and responds by launching new EC2 instances that satisfy the pod requirements. It considers CPU, memory, GPU, storage, availability zone, architecture (AMD64/ARM64), and topology spread constraints when selecting instances. Karpenter can launch a precisely-sized instance in under 60 seconds, compared to minutes with Cluster Autoscaler. As documented in the Karpenter documentation, it also handles node consolidation, deprovisioning empty or underutilized nodes to optimize cost.

NodePool and EC2NodeClass

Karpenter uses two primary custom resources: NodePool defines the constraints for node provisioning (instance families, sizes, AZs, taints, labels, and limits), while EC2NodeClass specifies the AWS-specific configuration (AMI family, subnet selector, security group selector, instance profile, and block device mappings). Multiple NodePools can coexist in a cluster, each targeting different workload profiles.

Spot Instance Integration

Karpenter's Spot instance support is a major cost optimization lever. By specifying capacity-type: ["spot", "on-demand"] in the NodePool, Karpenter automatically uses Spot instances when available and falls back to On-Demand when Spot capacity is unavailable. Karpenter handles Spot interruption notices by cordoning and draining affected nodes before termination, ensuring workload availability.

Terraform EKS Module with Karpenter and Node Groups

The following Terraform configuration deploys a production-grade EKS cluster with Karpenter autoscaling, managed node groups for system workloads, essential addons, and security hardening. This module integrates the VPC, cluster, and autoscaling layers into a cohesive deployment.

# Production EKS Cluster with Karpenter Autoscaling
# Repository: https://github.com/kogunlowo123/terraform-aws-eks

terraform {
  required_version = ">= 1.5.0"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.40"
    }
    helm = {
      source  = "hashicorp/helm"
      version = "~> 2.12"
    }
    kubectl = {
      source  = "alekc/kubectl"
      version = "~> 2.0"
    }
  }
}

locals {
  cluster_name = "${var.environment}-eks-${var.region_short}"
  tags = {
    Environment = var.environment
    ManagedBy   = "terraform"
    Team        = "platform-engineering"
  }
}

# --- KMS Key for Envelope Encryption ---
resource "aws_kms_key" "eks_secrets" {
  description             = "KMS key for EKS secrets encryption"
  deletion_window_in_days = 14
  enable_key_rotation     = true
  tags                    = local.tags
}

# --- EKS Cluster ---
module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 20.0"

  cluster_name    = local.cluster_name
  cluster_version = "1.30"

  # Network Configuration
  vpc_id     = var.vpc_id
  subnet_ids = var.private_subnet_ids

  # API Endpoint: Private only for production
  cluster_endpoint_public_access  = false
  cluster_endpoint_private_access = true

  # Encryption
  cluster_encryption_config = {
    provider_key_arn = aws_kms_key.eks_secrets.arn
    resources        = ["secrets"]
  }

  # Logging
  cluster_enabled_log_types = [
    "api", "audit", "authenticator",
    "controllerManager", "scheduler"
  ]

  # Authentication
  authentication_mode                      = "API_AND_CONFIG_MAP"
  enable_cluster_creator_admin_permissions = true

  # --- Managed Node Group: System Components ---
  eks_managed_node_groups = {
    system = {
      name            = "system-ng"
      instance_types  = ["m6i.large", "m7i.large"]
      capacity_type   = "ON_DEMAND"
      min_size        = 2
      max_size        = 4
      desired_size    = 2

      # Taint for system workloads only
      taints = {
        dedicated = {
          key    = "CriticalAddonsOnly"
          effect = "NO_SCHEDULE"
        }
      }

      labels = {
        role        = "system"
        "node-type" = "system"
      }

      # Use latest EKS-optimized AMI
      ami_type = "AL2023_x86_64_STANDARD"

      # Block device configuration
      block_device_mappings = {
        xvda = {
          device_name = "/dev/xvda"
          ebs = {
            volume_size           = 100
            volume_type           = "gp3"
            iops                  = 3000
            throughput            = 150
            encrypted             = true
            kms_key_id            = aws_kms_key.eks_secrets.arn
            delete_on_termination = true
          }
        }
      }

      update_config = {
        max_unavailable_percentage = 33
      }
    }
  }

  # --- EKS Managed Addons ---
  cluster_addons = {
    vpc-cni = {
      most_recent    = true
      before_compute = true
      configuration_values = jsonencode({
        env = {
          ENABLE_PREFIX_DELEGATION = "true"
          WARM_PREFIX_TARGET       = "1"
        }
      })
    }
    coredns = {
      most_recent = true
      configuration_values = jsonencode({
        computeType = "Fargate"
        replicaCount = 2
      })
    }
    kube-proxy = {
      most_recent = true
    }
    aws-ebs-csi-driver = {
      most_recent              = true
      service_account_role_arn = module.ebs_csi_irsa.iam_role_arn
    }
    adot = {
      most_recent = true
    }
  }

  tags = local.tags
}

# --- IRSA for EBS CSI Driver ---
module "ebs_csi_irsa" {
  source  = "terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks"
  version = "~> 5.0"

  role_name             = "${local.cluster_name}-ebs-csi"
  attach_ebs_csi_policy = true

  oidc_providers = {
    main = {
      provider_arn               = module.eks.oidc_provider_arn
      namespace_service_accounts = ["kube-system:ebs-csi-controller-sa"]
    }
  }

  tags = local.tags
}

# --- Karpenter ---
module "karpenter" {
  source  = "terraform-aws-modules/eks/aws//modules/karpenter"
  version = "~> 20.0"

  cluster_name = module.eks.cluster_name

  enable_v1_permissions           = true
  enable_pod_identity             = true
  create_pod_identity_association = true

  node_iam_role_additional_policies = {
    AmazonSSMManagedInstanceCore = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
  }

  tags = local.tags
}

resource "helm_release" "karpenter" {
  namespace        = "kube-system"
  name             = "karpenter"
  repository       = "oci://public.ecr.aws/karpenter"
  chart            = "karpenter"
  version          = "1.0.0"
  wait             = false
  create_namespace = false

  values = [
    yamlencode({
      settings = {
        clusterName       = module.eks.cluster_name
        clusterEndpoint   = module.eks.cluster_endpoint
        interruptionQueue = module.karpenter.queue_name
      }
    })
  ]

  depends_on = [module.karpenter]
}

# --- Karpenter NodePool for General Workloads ---
resource "kubectl_manifest" "karpenter_node_class" {
  yaml_body = yamlencode({
    apiVersion = "karpenter.k8s.aws/v1"
    kind       = "EC2NodeClass"
    metadata = {
      name = "default"
    }
    spec = {
      role           = module.karpenter.node_iam_role_name
      amiSelectorTerms = [{
        alias = "al2023@latest"
      }]
      subnetSelectorTerms = [{
        tags = {
          "karpenter.sh/discovery" = module.eks.cluster_name
        }
      }]
      securityGroupSelectorTerms = [{
        tags = {
          "karpenter.sh/discovery" = module.eks.cluster_name
        }
      }]
      blockDeviceMappings = [{
        deviceName = "/dev/xvda"
        ebs = {
          volumeSize          = "100Gi"
          volumeType          = "gp3"
          iops                = 3000
          throughput           = 150
          encrypted           = true
          deleteOnTermination = true
        }
      }]
    }
  })

  depends_on = [helm_release.karpenter]
}

resource "kubectl_manifest" "karpenter_node_pool" {
  yaml_body = yamlencode({
    apiVersion = "karpenter.sh/v1"
    kind       = "NodePool"
    metadata = {
      name = "general-purpose"
    }
    spec = {
      template = {
        metadata = {
          labels = {
            "node-type" = "general"
          }
        }
        spec = {
          requirements = [
            {
              key      = "kubernetes.io/arch"
              operator = "In"
              values   = ["amd64"]
            },
            {
              key      = "karpenter.sh/capacity-type"
              operator = "In"
              values   = ["spot", "on-demand"]
            },
            {
              key      = "karpenter.k8s.aws/instance-category"
              operator = "In"
              values   = ["m", "c", "r"]
            },
            {
              key      = "karpenter.k8s.aws/instance-generation"
              operator = "Gte"
              values   = ["6"]
            }
          ]
          nodeClassRef = {
            group = "karpenter.k8s.aws"
            kind  = "EC2NodeClass"
            name  = "default"
          }
          expireAfter = "720h" # 30 days max node age
        }
      }
      limits = {
        cpu    = "200"
        memory = "800Gi"
      }
      disruption = {
        consolidationPolicy = "WhenEmptyOrUnderutilized"
        consolidateAfter    = "1m"
      }
    }
  })

  depends_on = [kubectl_manifest.karpenter_node_class]
}

# --- Variables ---
variable "environment" {
  type        = string
  description = "Environment name (dev, staging, prod)"
}

variable "region_short" {
  type        = string
  description = "Short region identifier (e.g., use1)"
}

variable "vpc_id" {
  type        = string
  description = "VPC ID for the EKS cluster"
}

variable "private_subnet_ids" {
  type        = list(string)
  description = "Private subnet IDs for EKS worker nodes"
}

# --- Outputs ---
output "cluster_name" {
  value = module.eks.cluster_name
}

output "cluster_endpoint" {
  value = module.eks.cluster_endpoint
}

output "cluster_certificate_authority" {
  value = module.eks.cluster_certificate_authority_data
}

This module provisions a complete production EKS environment: a KMS key for secrets encryption, an EKS cluster with private API endpoint, a managed node group for system components with dedicated taints, all essential EKS addons with IRSA where required, and Karpenter for autoscaling general-purpose workloads with Spot instance support. The Karpenter NodePool is configured to use only modern instance families (generation 6+), supports both AMD64 architecture, and automatically consolidates underutilized nodes.

Essential EKS Addons for Production

EKS addons are operational software that provide essential cluster functionality. Managing them through EKS managed addons (rather than self-managed Helm charts) ensures compatibility with the cluster version and simplifies upgrades.

Networking: VPC CNI

The Amazon VPC CNI plugin assigns VPC IP addresses directly to pods, enabling native VPC networking without overlay networks. For production, enable prefix delegation (ENABLE_PREFIX_DELEGATION=true) to increase the number of pods per node from approximately 29 (on m5.large) to 110+. This dramatically improves node utilization and reduces the number of nodes needed. The VPC CNI must be deployed before compute (nodes) to ensure proper networking initialization.

DNS: CoreDNS

CoreDNS handles cluster DNS resolution. For production, deploy at least two replicas across different nodes for high availability. Consider running CoreDNS on the system node group with the CriticalAddonsOnly toleration to ensure DNS remains available even during scaling events. Monitor CoreDNS latency and cache hit rates as they directly impact application performance.

Storage: EBS CSI Driver

The EBS CSI driver enables dynamic provisioning of EBS volumes for persistent workloads. It must run with an IRSA role that has permissions to create, attach, and delete EBS volumes. Create StorageClasses for different workload profiles: gp3 for general purpose, io2 for high-performance databases, and sc1 for archival storage. Enable volume encryption by default in the StorageClass.

Observability: ADOT

The AWS Distro for OpenTelemetry (ADOT) provides a vendor-neutral observability pipeline for metrics, traces, and logs. ADOT collectors can send telemetry to CloudWatch, X-Ray, Prometheus, Grafana, and other backends. For production, deploy ADOT as a DaemonSet for infrastructure metrics and as a sidecar for application-level tracing.

Pod Security and IRSA

Securing pods in production EKS requires multiple layers: Pod Security Standards for restricting container capabilities, IRSA for granting AWS permissions to specific pods, network policies for controlling pod-to-pod traffic, and secrets management for protecting sensitive configuration.

Pod Security Standards

Kubernetes Pod Security Standards (PSS) replace the deprecated Pod Security Policies (PSP). PSS defines three profiles: Privileged (unrestricted, for system components only), Baseline (prevents known privilege escalations), and Restricted (hardened, for all application workloads). In production, enforce the Restricted profile on all application namespaces and the Baseline profile on system namespaces. Apply these using namespace labels:

pod-security.kubernetes.io/enforce: restricted and pod-security.kubernetes.io/warn: restricted on application namespaces. This prevents containers from running as root, using host networking, mounting host paths, or escalating privileges.

IAM Roles for Service Accounts (IRSA)

IRSA is the only acceptable method for granting AWS permissions to pods in production. It uses the EKS OIDC provider to create a trust relationship between Kubernetes service accounts and IAM roles. Each pod receives temporary AWS credentials scoped to its specific IAM role, with no shared credentials or long-lived keys. The Terraform module above demonstrates IRSA configuration for the EBS CSI driver; the same pattern applies to every application that needs AWS API access.

Network Policies

The VPC CNI supports Kubernetes Network Policies natively (as of VPC CNI v1.14+). Define default-deny ingress and egress policies on all namespaces, then explicitly allow required traffic flows. This implements microsegmentation at the pod level, preventing lateral movement if a pod is compromised. For more advanced policies (L7 filtering, DNS-based rules), deploy Calico or Cilium as a supplementary network policy engine.

EKS vs AKS vs GKE Feature Comparison

Choosing between managed Kubernetes services is a critical architectural decision. The following comparison captures the current state of EKS, AKS, and GKE across production-relevant dimensions.

Feature	EKS (AWS)	AKS (Azure)	GKE (GCP)
Control Plane Cost	$0.10/hr ($73/mo)	Free (standard), $0.10/hr (uptime SLA)	Free (standard), $0.10/hr (Autopilot)
Node Autoscaling	Karpenter, Cluster Autoscaler	KEDA, Cluster Autoscaler	GKE Autopilot, Cluster Autoscaler
Serverless Nodes	Fargate Profiles	ACI Virtual Nodes	Autopilot (fully serverless)
Networking	VPC CNI (native VPC IPs)	Azure CNI, kubenet	GKE Dataplane V2 (Cilium-based)
Service Mesh	VPC Lattice, App Mesh	Istio-based (managed), Open Service Mesh	Anthos Service Mesh (managed Istio)
Secrets Encryption	KMS Envelope Encryption	Azure Key Vault integration	Cloud KMS, Application-layer encryption
Identity Integration	IRSA, Pod Identity	Workload Identity (AAD)	Workload Identity (GCP IAM)
GPU Support	P4, P5, G5, Inf2 instances	NC, ND, NV series VMs	A100, H100, TPU v5e
Max Nodes per Cluster	5,000	5,000	15,000
Release Channel	Standard, Extended Support	Rapid, Regular, Stable	Rapid, Regular, Stable, Extended
Multi-Cluster Management	EKS Connector (basic)	Azure Arc, Fleet Manager	GKE Enterprise, Config Sync

GKE leads in ease of use with Autopilot (fully serverless Kubernetes) and has the highest node-per-cluster limit. AKS offers the best cost entry point with a free control plane and strong Azure AD integration. EKS provides the deepest AWS service integration and Karpenter's intelligent autoscaling is the most sophisticated node provisioning system available on any platform. For organizations primarily on AWS, EKS with Karpenter is the clear choice.

Self-Healing Kubernetes Architecture

Self-healing Kubernetes is the ability of the cluster to automatically detect, respond to, and recover from failures at the node, pod, and infrastructure layers without human intervention. This is critical for production workloads where downtime has direct business impact.

Node-Level Self-Healing

Karpenter provides node-level self-healing by automatically replacing nodes that fail health checks or become NotReady. When a node becomes unhealthy, Kubernetes marks pods as unschedulable, Karpenter detects the pending pods, and launches replacement nodes. The expireAfter setting (configured to 720 hours / 30 days in our module) ensures nodes are periodically recycled, preventing drift from configuration changes and avoiding long-running node issues.

For additional resilience, deploy the Node Problem Detector (NPD) DaemonSet, which monitors for hardware issues (disk failures, kernel panics, NTP sync failures) and reports them as node conditions. Combined with Karpenter's disruption settings, problematic nodes are automatically drained and replaced.

Pod-Level Self-Healing

Kubernetes built-in controllers (Deployments, StatefulSets, DaemonSets) automatically replace failed pods. For this to work correctly, every production pod must have properly configured liveness probes (restart unresponsive containers), readiness probes (remove unhealthy pods from service endpoints), and startup probes (allow slow-starting applications time to initialize). Pod Disruption Budgets (PDBs) ensure that voluntary disruptions (node drains, cluster upgrades) do not take down more than a specified number of pods simultaneously.

Application-Level Self-Healing

Beyond Kubernetes-native self-healing, implement application-level recovery patterns: circuit breakers (Istio, Envoy) prevent cascading failures across services, retry policies with exponential backoff handle transient errors, health check endpoints expose application-specific health status beyond basic port checks, and automated rollback (using Argo Rollouts or Flagger) reverts deployments that fail health checks during progressive delivery.

Our terraform-aws-auto-healing-eks module extends the base EKS configuration with Node Problem Detector, custom health check controllers, and automated remediation through AWS Systems Manager runbooks.

Best Practices for Production EKS

These best practices represent lessons learned from operating dozens of production EKS clusters across enterprise environments:

Use Karpenter for node autoscaling instead of Cluster Autoscaler. Karpenter provisions nodes in seconds, selects optimal instance types per workload, supports Spot with graceful fallback, and consolidates underutilized nodes automatically. Reserve managed node groups only for system-critical components (CoreDNS, Karpenter itself).
Enable private API endpoint and disable public access. The Kubernetes API should not be reachable from the internet in production. Use VPN, Direct Connect, or Systems Manager Session Manager for administrative access. If CI/CD requires API access, use a VPC-peered runner or bastion host.
Implement IRSA or Pod Identity for all AWS API access. Never use node-level IAM roles for application permissions. Each service account should have its own IAM role with least-privilege permissions. Use condition keys to restrict access to specific namespaces and service accounts.
Deploy Pod Security Standards in enforce mode. Apply the Restricted profile to all application namespaces. No production pods should run as root, use host networking, or have elevated capabilities. Use OPA Gatekeeper or Kyverno for additional policy enforcement beyond PSS.
Enable VPC CNI prefix delegation for high pod density. Default VPC CNI limits pods per node based on ENI capacity (roughly 29 pods on m5.large). Prefix delegation increases this to 110+, dramatically reducing node count and cost. Ensure your subnet CIDR supports the additional IP consumption.
Configure default-deny network policies on all namespaces. Use Kubernetes native network policies (supported by VPC CNI v1.14+) to restrict pod-to-pod traffic. Explicitly allow only required communication paths. This prevents lateral movement in case of pod compromise.
Use EKS managed addons for core components. VPC CNI, CoreDNS, kube-proxy, and EBS CSI driver should be deployed as EKS managed addons, not self-managed Helm releases. Managed addons are automatically tested for compatibility with your cluster version and can be upgraded in-place.
Implement GitOps for application deployment. Use Argo CD or Flux to manage application deployments declaratively from Git. This provides audit trails, easy rollback, multi-cluster consistency, and eliminates manual kubectl operations. Store all manifests in version-controlled repositories.
Plan for cluster upgrades from day one. EKS releases new Kubernetes versions approximately every four months, with each version supported for 14 months (or 26 months with extended support). Design your upgrade process, test with staging clusters, and schedule regular upgrades. Falling behind on versions accumulates technical debt and security risk.
Monitor cluster health holistically. Deploy a comprehensive observability stack: Prometheus + Grafana for metrics (node, pod, application), Loki or CloudWatch for log aggregation, X-Ray or Jaeger for distributed tracing, and custom dashboards for Karpenter node provisioning, pod scheduling latency, and API server performance.

Frequently Asked Questions

What is the best way to deploy EKS with Terraform?

The recommended approach uses the terraform-aws-modules/eks/aws community module as the foundation, combined with custom configurations for Karpenter autoscaling, EKS managed addons, and security hardening. Structure your Terraform as separate modules for VPC, EKS cluster, and add-on components with clear dependency management using module outputs and data sources. Use remote state (S3 + DynamoDB) for team collaboration and state locking.

Should I use Karpenter or Cluster Autoscaler for EKS?

Karpenter is recommended for most production EKS clusters. It provisions nodes directly through the EC2 Fleet API, bypasses ASG limitations, supports mixed instance types per workload, and responds to scheduling needs in seconds rather than the minutes typical of Cluster Autoscaler. Karpenter also handles node consolidation automatically, reducing cost by deprovisioning underutilized nodes. Cluster Autoscaler remains viable for simple, predictable workloads where ASG-based management is sufficient.

How do I secure an EKS cluster for production?

Production EKS security requires a layered approach: use private API endpoints only, implement IRSA for pod-level IAM permissions, enforce Pod Security Standards (Restricted profile) on application namespaces, deploy network policies with default-deny rules, enable envelope encryption for secrets with KMS, send all control plane logs (especially audit logs) to CloudWatch, deploy OPA Gatekeeper for custom policy enforcement, and conduct regular CIS Kubernetes benchmark scans.

What EKS addons should I install for production?

Essential production EKS addons include: vpc-cni (pod networking with prefix delegation), coredns (cluster DNS with high-availability replicas), kube-proxy (service networking), aws-ebs-csi-driver (persistent volume provisioning), aws-efs-csi-driver (shared storage for ReadWriteMany workloads), and adot (OpenTelemetry-based observability). Install all as EKS managed addons via Terraform for automatic version management, compatibility verification, and simplified upgrades.

How do I implement self-healing Kubernetes on EKS?

Self-healing EKS combines multiple layers: Karpenter automatically replaces failed or expired nodes and consolidates underutilized ones. Pod Disruption Budgets ensure safe voluntary disruptions during upgrades and scaling. Properly configured liveness, readiness, and startup probes enable Kubernetes to detect and restart unhealthy pods. Node Problem Detector identifies hardware-level issues and triggers node replacement. Automated rollback with Argo Rollouts or Flagger reverts failed deployments. Together, these create a cluster that recovers from failures at every level without operator intervention.

EKS Terraform Kubernetes Karpenter AWS DevOps Pod Security IRSA Autoscaling Infrastructure as Code

Need Enterprise-Grade Kubernetes on AWS?

Kehinde Ogunlowo and the team at Citadel Cloud Management design, deploy, and operate production EKS platforms for enterprises running mission-critical workloads on AWS.

Contact Kehinde at citadelcloudmanagement.com

Kehinde Ogunlowo

Principal Multi-Cloud DevSecOps Architect | Citadel Cloud Management

Kehinde specializes in building production Kubernetes platforms on AWS, Azure, and GCP. With deep expertise in EKS, Terraform, and DevSecOps automation, he designs self-healing, auto-scaling infrastructure for enterprises. He contributes open-source Terraform modules for EKS, networking, and security through his GitHub.

LinkedIn GitHub Citadel Cloud Management