Production-Ready AKS Clusters with Terraform — Security, Networking, and Auto-Scaling
Deploying Azure Kubernetes Service in production demands far more than a basic az aks create command. A production-ready AKS cluster requires careful consideration of networking architecture, identity management, security posture, autoscaling policies, and operational readiness. In my experience managing large-scale Kubernetes deployments across enterprise environments, the difference between a development cluster and a production cluster typically spans hundreds of configuration decisions.
This guide walks through building a production-grade AKS deployment using Terraform, covering every critical dimension: from choosing the right networking model to implementing workload identity, enabling Microsoft Defender for Containers, and architecting multi-pool autoscaling strategies. Every configuration shown here is drawn from real-world enterprise deployments and encapsulated in reusable Terraform modules.
Why Production-Grade AKS Matters
Azure Kubernetes Service simplifies the operational burden of running Kubernetes by managing the control plane, but the responsibility for a secure and resilient data plane remains with you. A production AKS cluster must address several critical areas that default configurations simply do not cover.
First, there is the question of network isolation. By default, AKS clusters are accessible from the public internet. Production clusters should use private endpoints, integrate with a hub-spoke network topology, and enforce network policies at the pod level. The terraform-azure-hub-spoke-network module provides the foundational networking layer that AKS clusters should be deployed into.
Second, identity and access management must follow zero-trust principles. Legacy approaches using service principal secrets are a security liability. Modern AKS deployments use workload identity with federated credentials, eliminating long-lived secrets entirely. This is a fundamental shift in how Kubernetes workloads authenticate to Azure services.
Third, runtime security requires continuous monitoring. Microsoft Defender for Containers provides threat detection, vulnerability scanning, and security posture management directly integrated with the AKS control plane. Without runtime protection, container escapes, cryptomining attacks, and lateral movement go undetected until the damage is done.
Finally, cost efficiency through intelligent autoscaling ensures you are not over-provisioning resources. A well-configured cluster autoscaler combined with horizontal pod autoscaling and spot node pools can reduce compute costs by 40-60% compared to statically provisioned clusters.
AKS Networking Modes: Kubenet vs Azure CNI vs CNI Overlay vs CNI Cilium
The networking model you choose for your AKS cluster has profound implications for scalability, security, and integration with the broader Azure ecosystem. Azure now offers four distinct networking modes, each with different trade-offs. Understanding these differences is essential for making the right architectural decision. Refer to the Azure CNI overlay documentation for the latest updates on networking options.
For most production deployments, Azure CNI Overlay offers the best balance of performance, scalability, and simplicity. It eliminates the IP exhaustion problems of traditional Azure CNI while maintaining full Azure networking integration. If your organization requires advanced observability, L7 network policies, or transparent encryption between pods, CNI with Cilium powered by eBPF is the superior choice, offering kernel-level packet processing that dramatically reduces latency and CPU overhead.
Terraform AKS Module with Workload Identity and Defender
The following Terraform configuration represents a production-grade AKS deployment. It enables workload identity, configures Microsoft Defender for Containers, sets up CNI Overlay networking, and provisions a multi-pool architecture with autoscaling. This configuration is part of the terraform-azure-aks module, which provides a complete, reusable AKS deployment framework.
This configuration represents hundreds of hours of production hardening distilled into a single module. Key architectural decisions include using only_critical_addons_enabled on the system node pool to prevent application workloads from competing with CoreDNS and kube-proxy for resources, ephemeral OS disks on user pools for improved I/O performance, and availability zones for fault tolerance across Azure data centers.
Configuring Workload Identity for Zero-Trust Access
Workload identity replaces the legacy pod identity and service principal approaches with a standards-based mechanism using OpenID Connect (OIDC) federation. Instead of mounting secrets into pods or injecting credentials through environment variables, workload identity allows Kubernetes service accounts to directly authenticate to Azure AD using short-lived tokens. This eliminates credential rotation concerns and reduces the attack surface of your application workloads.
The mechanism works by establishing a trust relationship between the AKS OIDC issuer and an Azure managed identity. When a pod requests a token, the Azure AD token exchange validates the Kubernetes service account token against the OIDC issuer URL, then issues an Azure AD token scoped to the managed identity's permissions. This entire flow happens without any secrets being stored in the cluster.
In the Terraform configuration above, the azurerm_federated_identity_credential resource establishes this trust. The subject field binds the credential to a specific Kubernetes service account in a specific namespace, ensuring that only the intended workload can assume the identity. This is a critical security boundary: even if another pod in the same cluster is compromised, it cannot impersonate the federated identity unless it runs with the exact service account specified.
For comprehensive virtual network integration, the terraform-azure-virtual-network module handles the subnet provisioning that AKS nodes are deployed into, including proper delegation and service endpoint configuration.
Microsoft Defender for Containers Integration
Enabling the microsoft_defender block in the Terraform configuration deploys a DaemonSet-based security sensor to every node in your AKS cluster. This sensor collects security events at the kernel level, including process execution, file system access, network connections, and system calls. These events are analyzed by Microsoft's threat intelligence engine to detect container-specific attacks such as cryptomining, reverse shells, privilege escalation attempts, and suspicious file downloads.
Defender for Containers also provides vulnerability assessment for container images stored in Azure Container Registry. Every image pushed to ACR is automatically scanned for known CVEs, and results are surfaced in the Azure Security Center dashboard with remediation recommendations. Combined with admission control policies, you can prevent vulnerable images from being deployed to production clusters.
The security posture management capabilities evaluate your cluster configuration against the CIS Kubernetes Benchmark and Azure security baselines. This includes checks for pod security standards, network policy enforcement, RBAC configuration, and secrets management. Non-compliant configurations are flagged with severity ratings and actionable remediation steps. For a thorough understanding of AKS security best practices, review the Azure AKS best practices documentation.
Node Pool Strategy and AKS Autoscaler Configuration
A well-designed node pool strategy is the foundation of a cost-effective and resilient AKS deployment. The Terraform module above implements a three-tier pool architecture that serves distinct purposes.
System Node Pool
The system pool runs with only_critical_addons_enabled = true, which applies a CriticalAddonsOnly=true:NoSchedule taint to prevent application pods from being scheduled. This guarantees that critical components like CoreDNS, metrics-server, and the AKS tunnel front have dedicated resources and are not affected by noisy neighbor workloads. The system pool uses a minimum of 3 nodes spread across availability zones for high availability.
Application Node Pool with Autoscaler
The application pool handles general workloads and scales between 2 and 20 nodes based on demand. The AKS cluster autoscaler monitors for unschedulable pods and provisions new nodes when existing capacity is exhausted. It also scales down when nodes are underutilized, respecting pod disruption budgets and graceful termination periods. Using ephemeral OS disks on this pool provides faster node startup times and improved I/O performance since the OS disk is backed by the VM's local SSD rather than remote Azure managed disk storage.
Spot Node Pool for Cost Optimization
The spot pool leverages Azure's excess capacity at up to 90% discount from regular pricing. The spot_max_price = -1 setting accepts any price up to the on-demand rate, maximizing availability. The eviction_policy = "Delete" ensures evicted nodes are removed and replaced rather than deallocated. A NoSchedule taint prevents non-spot-tolerant workloads from being scheduled, ensuring only fault-tolerant jobs with appropriate tolerations land on spot nodes.
Network Security with Hub-Spoke Topology
Production AKS clusters should be deployed into a hub-spoke network architecture where the hub contains centralized security services such as Azure Firewall, VPN gateways, and bastion hosts, while spoke virtual networks host individual workloads including AKS clusters. This pattern provides centralized traffic inspection, network segmentation, and simplified management of network security policies.
The terraform-azure-hub-spoke-network module provisions the complete hub-spoke infrastructure, including VNet peering, route tables with forced tunneling through Azure Firewall, and network security groups. When combined with the AKS module, all egress traffic from pods is routed through the hub firewall for inspection, ensuring compliance with enterprise security requirements.
Inside the cluster, Calico network policies enforce micro-segmentation at the pod level. Default-deny policies block all pod-to-pod communication unless explicitly allowed, following the principle of least privilege. This prevents lateral movement in the event of a container compromise. For clusters using the Cilium CNI, eBPF-based enforcement provides the same functionality with significantly lower overhead and the additional capability of L7 filtering on HTTP headers and DNS queries.
Best Practices for Production AKS Deployments
- Enable private cluster mode — Disable the public API server endpoint and access the cluster through a private endpoint within your VNet. Use Azure Private Link and a jump box or VPN for administrative access.
- Use managed identities exclusively — Eliminate all service principal secrets. Use user-assigned managed identities for the cluster and workload identity for application pods. The terraform-azure-key-vault module integrates with AKS CSI driver for any remaining secret management needs.
- Implement Azure RBAC for Kubernetes — Use Azure AD groups mapped to Kubernetes RBAC roles instead of managing Kubernetes-native role bindings. This centralizes access management and enables conditional access policies.
- Configure pod disruption budgets — Ensure every production deployment has a PDB that prevents voluntary disruptions from leaving fewer than the minimum required replicas running during node upgrades or scaling events.
- Enable Azure Policy for AKS — Apply built-in policy initiatives that enforce pod security standards, prevent privileged containers, require resource limits, and mandate image provenance from trusted registries.
- Set resource requests and limits on every pod — Without resource requests, the scheduler cannot make informed placement decisions, and the autoscaler cannot accurately determine when new nodes are needed. Without limits, a single pod can consume all node resources.
- Use availability zones — Spread nodes across all three availability zones in your region. Combined with pod anti-affinity rules, this ensures your application survives the loss of an entire data center.
- Implement GitOps with Flux v2 — Use AKS GitOps extensions to deploy applications declaratively from Git repositories. This provides audit trails, rollback capabilities, and drift detection.
- Configure maintenance windows — Schedule AKS upgrades and node image updates during off-peak hours. The maintenance window configuration in Terraform ensures predictable update behavior.
- Monitor with Container Insights and Prometheus — Enable Container Insights for Azure-native monitoring and deploy Azure Managed Prometheus for Kubernetes-native metrics. Use Grafana dashboards for unified observability across clusters.
Secrets Management with Azure Key Vault CSI Driver
Storing secrets in Kubernetes-native Secret objects is a well-known anti-pattern for production environments. Kubernetes Secrets are base64-encoded (not encrypted) and stored in etcd, making them vulnerable to cluster compromise. The Azure Key Vault Secrets Provider, enabled in the Terraform configuration through the key_vault_secrets_provider block, mounts secrets directly from Azure Key Vault into pods as volume mounts.
With secret_rotation_enabled = true and a 2-minute rotation interval, the CSI driver automatically fetches updated secret values from Key Vault without requiring pod restarts. Combined with workload identity, this creates a completely passwordless secrets pipeline: the pod authenticates to Key Vault using its federated identity, retrieves the secrets it is authorized to access, and mounts them as files in the pod filesystem.
The terraform-azure-key-vault module provisions the Key Vault instance with appropriate access policies, private endpoints, and diagnostic settings that integrate with the AKS deployment.
Frequently Asked Questions
What is the best networking model for production AKS clusters?
For production AKS clusters, Azure CNI Overlay is recommended for most workloads. It provides pod-level networking without consuming VNet IP addresses for each pod, supports up to 250 pods per node, and integrates natively with Azure networking features like Network Security Groups and Azure Firewall. For advanced use cases requiring L7 policies or transparent encryption, consider CNI with Cilium.
How do I enable workload identity on AKS with Terraform?
Enable workload identity by setting oidc_issuer_enabled and workload_identity_enabled to true in the azurerm_kubernetes_cluster resource. Then create a federated identity credential linking a Kubernetes service account to an Azure managed identity for secure, passwordless access to Azure services.
Should I use the AKS cluster autoscaler or KEDA for scaling?
Use both together. The AKS cluster autoscaler handles node-level scaling by adding or removing nodes based on pending pod scheduling. KEDA handles pod-level scaling based on event-driven metrics like queue depth or custom metrics. They complement each other for comprehensive autoscaling across both the infrastructure and application layers.
How does Microsoft Defender for Containers protect AKS?
Microsoft Defender for Containers provides runtime threat protection, vulnerability assessment for container images in ACR, security posture management, and real-time alerts for suspicious activities. It monitors cluster-level events, node activities, and container behavior using a DaemonSet-based sensor deployed across all cluster nodes.
What is the recommended node pool strategy for production AKS?
Use a dedicated system node pool with at least 3 nodes for critical system pods, separate user node pools for application workloads with autoscaling enabled, and optional spot node pools for fault-tolerant batch workloads. Apply taints and labels to control pod scheduling across pools, and spread nodes across availability zones for resilience.
Related Articles
Enterprise DevSecOps Pipeline Architecture for Multi-Cloud
Build security-first CI/CD pipelines spanning AWS, Azure, and GCP with shift-left scanning and compliance-as-code.
Building Autonomous AI Workflows with AWS Bedrock Agents
Orchestrate foundation models with knowledge bases and action groups for intelligent automation.
Share This Article
Need Help Building Production AKS Infrastructure?
Citadel Cloud Management specializes in designing and implementing production-grade Kubernetes platforms on Azure. From initial architecture through to operational excellence, we deliver secure, scalable infrastructure.
Get in TouchAbout the Author
Kehinde Ogunlowo
Principal Multi-Cloud DevSecOps Architect | Citadel Cloud Management
Kehinde architects enterprise cloud platforms across AWS, Azure, and GCP, specializing in Kubernetes orchestration, infrastructure as code, and security automation. With deep expertise in Terraform module design and multi-cloud networking, he helps organizations build resilient, compliant cloud infrastructure at scale.