Infrastructure as Code (IaC) has transformed how we provision and manage cloud resources. Yet writing Terraform configurations, CloudFormation templates, Kubernetes manifests, and Ansible playbooks remains a complex, error-prone task that demands deep expertise. AI assistants are changing this equation, enabling DevOps engineers to generate, optimize, and debug infrastructure code faster than ever before.
In this comprehensive guide, we'll explore how to effectively leverage AI for infrastructure management across the major IaC tools. You'll learn practical techniques for generating production-ready configurations, implementing security best practices, optimizing cloud costs, and debugging infrastructure issues. Teams adopting AI-assisted IaC report 60% faster infrastructure development and 40% fewer misconfigurations reaching production.
The AI-Assisted Infrastructure Landscape
Before diving into specific tools, let's understand where AI adds the most value in infrastructure management:
- Template Generation - Converting requirements to working IaC configurations
- Security Hardening - Identifying and fixing security misconfigurations
- Cost Optimization - Suggesting right-sized resources and cost-saving strategies
- Debugging - Interpreting cryptic error messages and plan failures
- Documentation - Generating module documentation and variable descriptions
- Migration - Converting between IaC formats (CloudFormation to Terraform, etc.)
The key principle is that AI excels at generating boilerplate and suggesting patterns, but human review remains essential for security-sensitive infrastructure decisions.
AI-Powered Terraform Development
Terraform's declarative HCL syntax is well-suited for AI generation, as patterns are highly predictable and well-documented. Let's explore practical AI workflows for Terraform.
Generating Terraform Configurations
Here's an effective prompt pattern for generating Terraform modules:
# Effective AI Prompt for Terraform Generation
"""
Create a Terraform module for an AWS ECS Fargate service with the following requirements:
Architecture:
- Application Load Balancer with HTTPS listener
- ECS Fargate service with auto-scaling (min: 2, max: 10)
- ECR repository for container images
- CloudWatch log group with 30-day retention
Security Requirements:
- Private subnets for ECS tasks
- Security groups with least-privilege rules
- IAM roles with minimal permissions
- Secrets Manager integration for environment variables
Input Variables:
- environment (dev/staging/prod)
- service_name
- container_port
- cpu and memory configurations
- domain_name for SSL certificate
Output:
- ALB DNS name
- ECS service ARN
- ECR repository URL
Follow AWS Well-Architected Framework best practices.
Use Terraform 1.5+ syntax with moved blocks for refactoring.
Include comprehensive variable validation.
"""
Here's what a well-structured AI response might generate:
# modules/ecs-fargate/main.tf
terraform {
required_version = ">= 1.5.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
# Data sources for existing resources
data "aws_vpc" "main" {
tags = {
Environment = var.environment
}
}
data "aws_subnets" "private" {
filter {
name = "vpc-id"
values = [data.aws_vpc.main.id]
}
tags = {
Tier = "private"
}
}
data "aws_subnets" "public" {
filter {
name = "vpc-id"
values = [data.aws_vpc.main.id]
}
tags = {
Tier = "public"
}
}
data "aws_acm_certificate" "main" {
domain = var.domain_name
statuses = ["ISSUED"]
most_recent = true
}
# ECR Repository
resource "aws_ecr_repository" "main" {
name = "${var.environment}-${var.service_name}"
image_tag_mutability = "IMMUTABLE"
image_scanning_configuration {
scan_on_push = true
}
encryption_configuration {
encryption_type = "KMS"
}
tags = local.common_tags
}
# ECR Lifecycle Policy
resource "aws_ecr_lifecycle_policy" "main" {
repository = aws_ecr_repository.main.name
policy = jsonencode({
rules = [
{
rulePriority = 1
description = "Keep last 10 images"
selection = {
tagStatus = "tagged"
tagPrefixList = ["v"]
countType = "imageCountMoreThan"
countNumber = 10
}
action = {
type = "expire"
}
},
{
rulePriority = 2
description = "Delete untagged images older than 7 days"
selection = {
tagStatus = "untagged"
countType = "sinceImagePushed"
countUnit = "days"
countNumber = 7
}
action = {
type = "expire"
}
}
]
})
}
# CloudWatch Log Group
resource "aws_cloudwatch_log_group" "main" {
name = "/ecs/${var.environment}/${var.service_name}"
retention_in_days = var.log_retention_days
tags = local.common_tags
}
# ECS Cluster
resource "aws_ecs_cluster" "main" {
name = "${var.environment}-${var.service_name}"
setting {
name = "containerInsights"
value = "enabled"
}
configuration {
execute_command_configuration {
logging = "OVERRIDE"
log_configuration {
cloud_watch_log_group_name = aws_cloudwatch_log_group.main.name
}
}
}
tags = local.common_tags
}
# ECS Task Execution Role
resource "aws_iam_role" "ecs_execution" {
name = "${var.environment}-${var.service_name}-execution"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "ecs-tasks.amazonaws.com"
}
}
]
})
tags = local.common_tags
}
resource "aws_iam_role_policy_attachment" "ecs_execution" {
role = aws_iam_role.ecs_execution.name
policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
}
# Custom policy for Secrets Manager access
resource "aws_iam_role_policy" "secrets_access" {
name = "${var.environment}-${var.service_name}-secrets"
role = aws_iam_role.ecs_execution.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = [
"secretsmanager:GetSecretValue"
]
Resource = [
"arn:aws:secretsmanager:${data.aws_region.current.name}:${data.aws_caller_identity.current.account_id}:secret:${var.environment}/${var.service_name}/*"
]
}
]
})
}
# ECS Task Role (for application permissions)
resource "aws_iam_role" "ecs_task" {
name = "${var.environment}-${var.service_name}-task"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "ecs-tasks.amazonaws.com"
}
}
]
})
tags = local.common_tags
}
# Security Group for ALB
resource "aws_security_group" "alb" {
name = "${var.environment}-${var.service_name}-alb"
description = "Security group for ALB"
vpc_id = data.aws_vpc.main.id
ingress {
description = "HTTPS from anywhere"
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
ingress {
description = "HTTP for redirect"
from_port = 80
to_port = 80
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
egress {
description = "Allow all outbound"
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
tags = merge(local.common_tags, {
Name = "${var.environment}-${var.service_name}-alb"
})
}
# Security Group for ECS Tasks
resource "aws_security_group" "ecs_tasks" {
name = "${var.environment}-${var.service_name}-ecs"
description = "Security group for ECS tasks"
vpc_id = data.aws_vpc.main.id
ingress {
description = "Allow traffic from ALB"
from_port = var.container_port
to_port = var.container_port
protocol = "tcp"
security_groups = [aws_security_group.alb.id]
}
egress {
description = "Allow all outbound"
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
tags = merge(local.common_tags, {
Name = "${var.environment}-${var.service_name}-ecs"
})
}
# Application Load Balancer
resource "aws_lb" "main" {
name = "${var.environment}-${var.service_name}"
internal = false
load_balancer_type = "application"
security_groups = [aws_security_group.alb.id]
subnets = data.aws_subnets.public.ids
enable_deletion_protection = var.environment == "prod"
tags = local.common_tags
}
# ALB Target Group
resource "aws_lb_target_group" "main" {
name = "${var.environment}-${var.service_name}"
port = var.container_port
protocol = "HTTP"
vpc_id = data.aws_vpc.main.id
target_type = "ip"
health_check {
enabled = true
healthy_threshold = 2
interval = 30
matcher = "200"
path = var.health_check_path
port = "traffic-port"
protocol = "HTTP"
timeout = 5
unhealthy_threshold = 3
}
tags = local.common_tags
}
# HTTPS Listener
resource "aws_lb_listener" "https" {
load_balancer_arn = aws_lb.main.arn
port = "443"
protocol = "HTTPS"
ssl_policy = "ELBSecurityPolicy-TLS13-1-2-2021-06"
certificate_arn = data.aws_acm_certificate.main.arn
default_action {
type = "forward"
target_group_arn = aws_lb_target_group.main.arn
}
}
# HTTP to HTTPS Redirect
resource "aws_lb_listener" "http_redirect" {
load_balancer_arn = aws_lb.main.arn
port = "80"
protocol = "HTTP"
default_action {
type = "redirect"
redirect {
port = "443"
protocol = "HTTPS"
status_code = "HTTP_301"
}
}
}
# ECS Task Definition
resource "aws_ecs_task_definition" "main" {
family = "${var.environment}-${var.service_name}"
network_mode = "awsvpc"
requires_compatibilities = ["FARGATE"]
cpu = var.cpu
memory = var.memory
execution_role_arn = aws_iam_role.ecs_execution.arn
task_role_arn = aws_iam_role.ecs_task.arn
container_definitions = jsonencode([
{
name = var.service_name
image = "${aws_ecr_repository.main.repository_url}:latest"
portMappings = [
{
containerPort = var.container_port
hostPort = var.container_port
protocol = "tcp"
}
]
environment = [
{
name = "ENVIRONMENT"
value = var.environment
},
{
name = "PORT"
value = tostring(var.container_port)
}
]
secrets = [
for secret_name in var.secret_names : {
name = secret_name
valueFrom = "arn:aws:secretsmanager:${data.aws_region.current.name}:${data.aws_caller_identity.current.account_id}:secret:${var.environment}/${var.service_name}/${secret_name}"
}
]
logConfiguration = {
logDriver = "awslogs"
options = {
"awslogs-group" = aws_cloudwatch_log_group.main.name
"awslogs-region" = data.aws_region.current.name
"awslogs-stream-prefix" = "ecs"
}
}
healthCheck = {
command = ["CMD-SHELL", "curl -f http://localhost:${var.container_port}${var.health_check_path} || exit 1"]
interval = 30
timeout = 5
retries = 3
startPeriod = 60
}
}
])
tags = local.common_tags
}
# ECS Service
resource "aws_ecs_service" "main" {
name = var.service_name
cluster = aws_ecs_cluster.main.id
task_definition = aws_ecs_task_definition.main.arn
desired_count = var.min_capacity
launch_type = "FARGATE"
network_configuration {
subnets = data.aws_subnets.private.ids
security_groups = [aws_security_group.ecs_tasks.id]
assign_public_ip = false
}
load_balancer {
target_group_arn = aws_lb_target_group.main.arn
container_name = var.service_name
container_port = var.container_port
}
deployment_configuration {
maximum_percent = 200
minimum_healthy_percent = 100
}
deployment_circuit_breaker {
enable = true
rollback = true
}
lifecycle {
ignore_changes = [desired_count]
}
tags = local.common_tags
}
# Auto Scaling Target
resource "aws_appautoscaling_target" "ecs" {
max_capacity = var.max_capacity
min_capacity = var.min_capacity
resource_id = "service/${aws_ecs_cluster.main.name}/${aws_ecs_service.main.name}"
scalable_dimension = "ecs:service:DesiredCount"
service_namespace = "ecs"
}
# CPU-based Auto Scaling
resource "aws_appautoscaling_policy" "cpu" {
name = "${var.environment}-${var.service_name}-cpu"
policy_type = "TargetTrackingScaling"
resource_id = aws_appautoscaling_target.ecs.resource_id
scalable_dimension = aws_appautoscaling_target.ecs.scalable_dimension
service_namespace = aws_appautoscaling_target.ecs.service_namespace
target_tracking_scaling_policy_configuration {
predefined_metric_specification {
predefined_metric_type = "ECSServiceAverageCPUUtilization"
}
target_value = 70.0
scale_in_cooldown = 300
scale_out_cooldown = 60
}
}
# Memory-based Auto Scaling
resource "aws_appautoscaling_policy" "memory" {
name = "${var.environment}-${var.service_name}-memory"
policy_type = "TargetTrackingScaling"
resource_id = aws_appautoscaling_target.ecs.resource_id
scalable_dimension = aws_appautoscaling_target.ecs.scalable_dimension
service_namespace = aws_appautoscaling_target.ecs.service_namespace
target_tracking_scaling_policy_configuration {
predefined_metric_specification {
predefined_metric_type = "ECSServiceAverageMemoryUtilization"
}
target_value = 80.0
scale_in_cooldown = 300
scale_out_cooldown = 60
}
}
# Data sources for account info
data "aws_region" "current" {}
data "aws_caller_identity" "current" {}
# Local values
locals {
common_tags = {
Environment = var.environment
Service = var.service_name
ManagedBy = "terraform"
}
}
# modules/ecs-fargate/variables.tf
variable "environment" {
description = "Environment name (dev, staging, prod)"
type = string
validation {
condition = contains(["dev", "staging", "prod"], var.environment)
error_message = "Environment must be dev, staging, or prod."
}
}
variable "service_name" {
description = "Name of the ECS service"
type = string
validation {
condition = can(regex("^[a-z][a-z0-9-]{2,28}[a-z0-9]$", var.service_name))
error_message = "Service name must be 4-30 lowercase alphanumeric characters or hyphens."
}
}
variable "container_port" {
description = "Port the container listens on"
type = number
default = 8080
validation {
condition = var.container_port >= 1 && var.container_port <= 65535
error_message = "Container port must be between 1 and 65535."
}
}
variable "cpu" {
description = "CPU units for the task (256, 512, 1024, 2048, 4096)"
type = number
default = 256
validation {
condition = contains([256, 512, 1024, 2048, 4096], var.cpu)
error_message = "CPU must be 256, 512, 1024, 2048, or 4096."
}
}
variable "memory" {
description = "Memory in MB for the task"
type = number
default = 512
validation {
condition = var.memory >= 512 && var.memory <= 30720
error_message = "Memory must be between 512 and 30720 MB."
}
}
variable "min_capacity" {
description = "Minimum number of tasks"
type = number
default = 2
}
variable "max_capacity" {
description = "Maximum number of tasks"
type = number
default = 10
}
variable "domain_name" {
description = "Domain name for SSL certificate lookup"
type = string
}
variable "health_check_path" {
description = "Path for health check endpoint"
type = string
default = "/health"
}
variable "log_retention_days" {
description = "CloudWatch log retention in days"
type = number
default = 30
validation {
condition = contains([1, 3, 5, 7, 14, 30, 60, 90, 120, 150, 180, 365, 400, 545, 731, 1827, 3653], var.log_retention_days)
error_message = "Log retention must be a valid CloudWatch retention period."
}
}
variable "secret_names" {
description = "List of secret names to inject as environment variables"
type = list(string)
default = []
}
# modules/ecs-fargate/outputs.tf
output "alb_dns_name" {
description = "DNS name of the Application Load Balancer"
value = aws_lb.main.dns_name
}
output "alb_zone_id" {
description = "Zone ID of the ALB for Route53 alias records"
value = aws_lb.main.zone_id
}
output "ecs_service_arn" {
description = "ARN of the ECS service"
value = aws_ecs_service.main.id
}
output "ecs_cluster_name" {
description = "Name of the ECS cluster"
value = aws_ecs_cluster.main.name
}
output "ecr_repository_url" {
description = "URL of the ECR repository"
value = aws_ecr_repository.main.repository_url
}
output "task_execution_role_arn" {
description = "ARN of the task execution IAM role"
value = aws_iam_role.ecs_execution.arn
}
output "task_role_arn" {
description = "ARN of the task IAM role"
value = aws_iam_role.ecs_task.arn
}
AI-Assisted Security Scanning
After generating Terraform code with AI, always run security scanning. Here's a workflow combining AI generation with automated validation:
# scripts/terraform-ai-workflow.sh
#!/bin/bash
# Step 1: Generate Terraform with AI (saved to generated.tf)
# Step 2: Format and validate
terraform fmt -recursive
terraform validate
# Step 3: Security scanning with tfsec
echo "Running tfsec security scan..."
tfsec . --format json > tfsec-results.json
# Step 4: Policy compliance with checkov
echo "Running checkov compliance scan..."
checkov -d . --output json > checkov-results.json
# Step 5: Cost estimation with infracost
echo "Estimating infrastructure costs..."
infracost breakdown --path . --format json > infracost-results.json
# Step 6: AI review of scan results
cat <<EOF | claude-cli
Review these infrastructure security scan results and provide:
1. Critical issues that must be fixed before deployment
2. Recommended fixes with code snippets
3. Risk assessment for each finding
tfsec results:
$(cat tfsec-results.json)
checkov results:
$(cat checkov-results.json)
infracost estimate:
$(cat infracost-results.json)
EOF
AI-Powered CloudFormation Templates
AWS CloudFormation's YAML/JSON templates can be complex. AI helps generate well-structured templates with proper dependencies and error handling.
Generating CloudFormation Stacks
# AI-generated CloudFormation template for a serverless API
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: >
Serverless REST API with Lambda, API Gateway, and DynamoDB
Parameters:
Environment:
Type: String
AllowedValues:
- dev
- staging
- prod
Default: dev
Description: Deployment environment
ServiceName:
Type: String
MinLength: 3
MaxLength: 30
AllowedPattern: ^[a-z][a-z0-9-]+$
Description: Name of the service
Globals:
Function:
Runtime: nodejs20.x
Timeout: 30
MemorySize: 256
Tracing: Active
Environment:
Variables:
ENVIRONMENT: !Ref Environment
TABLE_NAME: !Ref DynamoDBTable
LOG_LEVEL: !If [IsProd, 'warn', 'debug']
Conditions:
IsProd: !Equals [!Ref Environment, 'prod']
Resources:
# DynamoDB Table
DynamoDBTable:
Type: AWS::DynamoDB::Table
DeletionPolicy: !If [IsProd, Retain, Delete]
UpdateReplacePolicy: !If [IsProd, Retain, Delete]
Properties:
TableName: !Sub ${Environment}-${ServiceName}
BillingMode: PAY_PER_REQUEST
AttributeDefinitions:
- AttributeName: PK
AttributeType: S
- AttributeName: SK
AttributeType: S
- AttributeName: GSI1PK
AttributeType: S
- AttributeName: GSI1SK
AttributeType: S
KeySchema:
- AttributeName: PK
KeyType: HASH
- AttributeName: SK
KeyType: RANGE
GlobalSecondaryIndexes:
- IndexName: GSI1
KeySchema:
- AttributeName: GSI1PK
KeyType: HASH
- AttributeName: GSI1SK
KeyType: RANGE
Projection:
ProjectionType: ALL
PointInTimeRecoverySpecification:
PointInTimeRecoveryEnabled: !If [IsProd, true, false]
SSESpecification:
SSEEnabled: true
Tags:
- Key: Environment
Value: !Ref Environment
- Key: Service
Value: !Ref ServiceName
# Lambda Execution Role
LambdaExecutionRole:
Type: AWS::IAM::Role
Properties:
RoleName: !Sub ${Environment}-${ServiceName}-lambda
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: lambda.amazonaws.com
Action: sts:AssumeRole
ManagedPolicyArns:
- arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
- arn:aws:iam::aws:policy/AWSXRayDaemonWriteAccess
Policies:
- PolicyName: DynamoDBAccess
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- dynamodb:GetItem
- dynamodb:PutItem
- dynamodb:UpdateItem
- dynamodb:DeleteItem
- dynamodb:Query
- dynamodb:Scan
Resource:
- !GetAtt DynamoDBTable.Arn
- !Sub ${DynamoDBTable.Arn}/index/*
# API Lambda Function
ApiFunction:
Type: AWS::Serverless::Function
Properties:
FunctionName: !Sub ${Environment}-${ServiceName}-api
Handler: index.handler
CodeUri: ./src
Role: !GetAtt LambdaExecutionRole.Arn
Events:
ApiGatewayRoot:
Type: Api
Properties:
RestApiId: !Ref ApiGateway
Path: /{proxy+}
Method: ANY
# API Gateway
ApiGateway:
Type: AWS::Serverless::Api
Properties:
Name: !Sub ${Environment}-${ServiceName}
StageName: !Ref Environment
TracingEnabled: true
AccessLogSetting:
DestinationArn: !GetAtt ApiGatewayLogGroup.Arn
Format: '{"requestId":"$context.requestId","ip":"$context.identity.sourceIp","requestTime":"$context.requestTime","httpMethod":"$context.httpMethod","path":"$context.path","status":"$context.status","responseLatency":"$context.responseLatency"}'
MethodSettings:
- ResourcePath: /*
HttpMethod: '*'
ThrottlingBurstLimit: !If [IsProd, 5000, 1000]
ThrottlingRateLimit: !If [IsProd, 10000, 2000]
Cors:
AllowMethods: "'GET,POST,PUT,DELETE,OPTIONS'"
AllowHeaders: "'Content-Type,Authorization,X-Amz-Date,X-Api-Key'"
AllowOrigin: !If [IsProd, "'https://example.com'", "'*'"]
# CloudWatch Log Group for API Gateway
ApiGatewayLogGroup:
Type: AWS::Logs::LogGroup
Properties:
LogGroupName: !Sub /aws/apigateway/${Environment}-${ServiceName}
RetentionInDays: !If [IsProd, 90, 14]
# CloudWatch Alarms
LambdaErrorAlarm:
Type: AWS::CloudWatch::Alarm
Condition: IsProd
Properties:
AlarmName: !Sub ${Environment}-${ServiceName}-lambda-errors
AlarmDescription: Lambda function error rate alarm
MetricName: Errors
Namespace: AWS/Lambda
Statistic: Sum
Period: 60
EvaluationPeriods: 5
Threshold: 10
ComparisonOperator: GreaterThanThreshold
Dimensions:
- Name: FunctionName
Value: !Ref ApiFunction
TreatMissingData: notBreaching
ApiGateway5xxAlarm:
Type: AWS::CloudWatch::Alarm
Condition: IsProd
Properties:
AlarmName: !Sub ${Environment}-${ServiceName}-api-5xx
AlarmDescription: API Gateway 5XX error rate alarm
MetricName: 5XXError
Namespace: AWS/ApiGateway
Statistic: Sum
Period: 60
EvaluationPeriods: 5
Threshold: 10
ComparisonOperator: GreaterThanThreshold
Dimensions:
- Name: ApiName
Value: !Ref ApiGateway
- Name: Stage
Value: !Ref Environment
Outputs:
ApiEndpoint:
Description: API Gateway endpoint URL
Value: !Sub https://${ApiGateway}.execute-api.${AWS::Region}.amazonaws.com/${Environment}
Export:
Name: !Sub ${Environment}-${ServiceName}-ApiEndpoint
DynamoDBTableName:
Description: DynamoDB table name
Value: !Ref DynamoDBTable
Export:
Name: !Sub ${Environment}-${ServiceName}-TableName
LambdaFunctionArn:
Description: Lambda function ARN
Value: !GetAtt ApiFunction.Arn
Export:
Name: !Sub ${Environment}-${ServiceName}-LambdaArn
AI-Powered Kubernetes Configurations
Kubernetes manifests require careful attention to resource limits, security contexts, and proper configuration. AI can generate production-ready YAML with best practices baked in.
Generating Production Kubernetes Manifests
# AI Prompt: Generate a production-ready Kubernetes deployment for a Node.js API
# with security hardening, resource limits, health checks, and HPA
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-service
namespace: production
labels:
app: api-service
version: v1
spec:
replicas: 3
revisionHistoryLimit: 5
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app: api-service
template:
metadata:
labels:
app: api-service
version: v1
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
prometheus.io/path: "/metrics"
spec:
serviceAccountName: api-service
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 1000
fsGroup: 1000
seccompProfile:
type: RuntimeDefault
# Pod topology spread for high availability
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: api-service
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: api-service
containers:
- name: api
image: myregistry/api-service:v1.2.3
imagePullPolicy: IfNotPresent
ports:
- name: http
containerPort: 8080
protocol: TCP
- name: metrics
containerPort: 9090
protocol: TCP
# Security hardening
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
# Resource management
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
# Health checks
livenessProbe:
httpGet:
path: /health/live
port: http
initialDelaySeconds: 15
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready
port: http
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
startupProbe:
httpGet:
path: /health/live
port: http
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 30
# Environment configuration
env:
- name: NODE_ENV
value: production
- name: PORT
value: "8080"
- name: LOG_LEVEL
value: info
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
# Secrets from external secrets manager
envFrom:
- secretRef:
name: api-service-secrets
# Volume mounts
volumeMounts:
- name: tmp
mountPath: /tmp
- name: cache
mountPath: /app/.cache
volumes:
- name: tmp
emptyDir:
sizeLimit: 100Mi
- name: cache
emptyDir:
sizeLimit: 500Mi
# Graceful shutdown
terminationGracePeriodSeconds: 30
---
# service.yaml
apiVersion: v1
kind: Service
metadata:
name: api-service
namespace: production
labels:
app: api-service
spec:
type: ClusterIP
ports:
- name: http
port: 80
targetPort: http
protocol: TCP
selector:
app: api-service
---
# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-service
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-service
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
- type: Pods
value: 4
periodSeconds: 15
selectPolicy: Max
---
# pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-service
namespace: production
spec:
minAvailable: 2
selector:
matchLabels:
app: api-service
---
# networkpolicy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: api-service
namespace: production
spec:
podSelector:
matchLabels:
app: api-service
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: ingress-nginx
- podSelector:
matchLabels:
app: frontend
ports:
- protocol: TCP
port: 8080
egress:
- to:
- namespaceSelector:
matchLabels:
name: database
ports:
- protocol: TCP
port: 5432
- to:
- namespaceSelector: {}
podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- protocol: UDP
port: 53
---
# serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: api-service
namespace: production
annotations:
# For AWS IAM Roles for Service Accounts
eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/api-service-role
automountServiceAccountToken: false
AI-Generated Ansible Playbooks
Ansible's declarative YAML playbooks are another excellent target for AI generation. Here's how to create robust, idempotent playbooks with AI assistance.
Production Ansible Playbook Generation
# AI Prompt: Create an Ansible playbook to configure a production web server
# with Nginx, Node.js, security hardening, and monitoring
---
# site.yml - Main playbook
- name: Configure production web servers
hosts: webservers
become: true
vars_files:
- vars/main.yml
- vars/secrets.yml
pre_tasks:
- name: Update apt cache
apt:
update_cache: yes
cache_valid_time: 3600
when: ansible_os_family == "Debian"
- name: Gather facts about services
service_facts:
roles:
- role: security_hardening
tags: security
- role: nginx
tags: nginx
- role: nodejs
tags: nodejs
- role: monitoring
tags: monitoring
post_tasks:
- name: Verify all services are running
service:
name: "{{ item }}"
state: started
enabled: yes
loop:
- nginx
- node_exporter
tags: verify
---
# roles/security_hardening/tasks/main.yml
- name: Install security packages
apt:
name:
- fail2ban
- ufw
- unattended-upgrades
- logrotate
state: present
tags: packages
- name: Configure automatic security updates
template:
src: 20auto-upgrades.j2
dest: /etc/apt/apt.conf.d/20auto-upgrades
mode: '0644'
- name: Configure UFW defaults
ufw:
direction: "{{ item.direction }}"
policy: "{{ item.policy }}"
loop:
- { direction: incoming, policy: deny }
- { direction: outgoing, policy: allow }
- name: Allow SSH through UFW
ufw:
rule: allow
port: "{{ ssh_port }}"
proto: tcp
- name: Allow HTTP/HTTPS through UFW
ufw:
rule: allow
port: "{{ item }}"
proto: tcp
loop:
- 80
- 443
- name: Enable UFW
ufw:
state: enabled
logging: 'on'
- name: Configure SSH hardening
template:
src: sshd_config.j2
dest: /etc/ssh/sshd_config
mode: '0600'
validate: '/usr/sbin/sshd -t -f %s'
notify: Restart SSH
- name: Configure fail2ban for SSH
template:
src: jail.local.j2
dest: /etc/fail2ban/jail.local
mode: '0644'
notify: Restart fail2ban
- name: Set kernel security parameters
sysctl:
name: "{{ item.name }}"
value: "{{ item.value }}"
sysctl_file: /etc/sysctl.d/99-security.conf
reload: yes
loop:
- { name: 'net.ipv4.conf.all.rp_filter', value: '1' }
- { name: 'net.ipv4.conf.default.rp_filter', value: '1' }
- { name: 'net.ipv4.icmp_echo_ignore_broadcasts', value: '1' }
- { name: 'net.ipv4.conf.all.accept_source_route', value: '0' }
- { name: 'net.ipv4.conf.all.send_redirects', value: '0' }
- { name: 'net.ipv4.tcp_syncookies', value: '1' }
- { name: 'net.ipv4.tcp_max_syn_backlog', value: '2048' }
- { name: 'kernel.randomize_va_space', value: '2' }
---
# roles/nginx/tasks/main.yml
- name: Install Nginx
apt:
name: nginx
state: present
- name: Create Nginx directories
file:
path: "{{ item }}"
state: directory
owner: www-data
group: www-data
mode: '0755'
loop:
- /etc/nginx/sites-available
- /etc/nginx/sites-enabled
- /etc/nginx/ssl
- /var/log/nginx
- /var/cache/nginx
- name: Generate DH parameters
command: openssl dhparam -out /etc/nginx/ssl/dhparam.pem 2048
args:
creates: /etc/nginx/ssl/dhparam.pem
- name: Deploy Nginx main configuration
template:
src: nginx.conf.j2
dest: /etc/nginx/nginx.conf
mode: '0644'
validate: 'nginx -t -c %s'
notify: Reload Nginx
- name: Deploy site configuration
template:
src: site.conf.j2
dest: /etc/nginx/sites-available/{{ app_name }}.conf
mode: '0644'
notify: Reload Nginx
- name: Enable site
file:
src: /etc/nginx/sites-available/{{ app_name }}.conf
dest: /etc/nginx/sites-enabled/{{ app_name }}.conf
state: link
notify: Reload Nginx
- name: Remove default site
file:
path: /etc/nginx/sites-enabled/default
state: absent
notify: Reload Nginx
- name: Start and enable Nginx
service:
name: nginx
state: started
enabled: yes
---
# roles/nginx/templates/site.conf.j2
# Upstream for Node.js application
upstream nodejs_backend {
least_conn;
server 127.0.0.1:{{ nodejs_port }} weight=1 max_fails=3 fail_timeout=30s;
keepalive 32;
}
# Rate limiting zone
limit_req_zone $binary_remote_addr zone=api_limit:10m rate=10r/s;
limit_conn_zone $binary_remote_addr zone=conn_limit:10m;
# HTTP to HTTPS redirect
server {
listen 80;
listen [::]:80;
server_name {{ domain_name }};
location /.well-known/acme-challenge/ {
root /var/www/certbot;
}
location / {
return 301 https://$server_name$request_uri;
}
}
# HTTPS server
server {
listen 443 ssl http2;
listen [::]:443 ssl http2;
server_name {{ domain_name }};
# SSL configuration
ssl_certificate /etc/nginx/ssl/{{ domain_name }}.crt;
ssl_certificate_key /etc/nginx/ssl/{{ domain_name }}.key;
ssl_session_timeout 1d;
ssl_session_cache shared:SSL:50m;
ssl_session_tickets off;
# Modern SSL configuration
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384;
ssl_prefer_server_ciphers off;
ssl_dhparam /etc/nginx/ssl/dhparam.pem;
# HSTS
add_header Strict-Transport-Security "max-age=63072000" always;
# Security headers
add_header X-Frame-Options "SAMEORIGIN" always;
add_header X-Content-Type-Options "nosniff" always;
add_header X-XSS-Protection "1; mode=block" always;
add_header Referrer-Policy "strict-origin-when-cross-origin" always;
# Rate limiting
limit_req zone=api_limit burst=20 nodelay;
limit_conn conn_limit 10;
# Logging
access_log /var/log/nginx/{{ app_name }}_access.log combined;
error_log /var/log/nginx/{{ app_name }}_error.log warn;
# Proxy to Node.js
location / {
proxy_pass http://nodejs_backend;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection 'upgrade';
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_cache_bypass $http_upgrade;
proxy_read_timeout 90s;
proxy_connect_timeout 90s;
proxy_send_timeout 90s;
}
# Health check endpoint
location /health {
proxy_pass http://nodejs_backend/health;
access_log off;
}
# Static files
location /static/ {
alias /var/www/{{ app_name }}/static/;
expires 1y;
add_header Cache-Control "public, immutable";
}
}
---
# roles/nodejs/tasks/main.yml
- name: Install Node.js repository
shell: |
curl -fsSL https://deb.nodesource.com/setup_{{ nodejs_version }}.x | bash -
args:
creates: /etc/apt/sources.list.d/nodesource.list
- name: Install Node.js
apt:
name: nodejs
state: present
- name: Install PM2 globally
npm:
name: pm2
global: yes
state: present
- name: Create application user
user:
name: "{{ app_user }}"
system: yes
shell: /bin/false
home: "/var/www/{{ app_name }}"
create_home: yes
- name: Create application directories
file:
path: "{{ item }}"
state: directory
owner: "{{ app_user }}"
group: "{{ app_user }}"
mode: '0755'
loop:
- "/var/www/{{ app_name }}"
- "/var/www/{{ app_name }}/releases"
- "/var/www/{{ app_name }}/shared"
- "/var/log/{{ app_name }}"
- name: Deploy PM2 ecosystem file
template:
src: ecosystem.config.js.j2
dest: "/var/www/{{ app_name }}/ecosystem.config.js"
owner: "{{ app_user }}"
group: "{{ app_user }}"
mode: '0644'
notify: Restart PM2 application
- name: Configure PM2 startup
command: pm2 startup systemd -u {{ app_user }} --hp /var/www/{{ app_name }}
args:
creates: /etc/systemd/system/pm2-{{ app_user }}.service
---
# vars/main.yml
app_name: myapp
app_user: myapp
domain_name: api.example.com
nodejs_version: 20
nodejs_port: 3000
ssh_port: 22
# Security settings
allowed_ssh_users:
- deploy
- admin
fail2ban_maxretry: 5
fail2ban_bantime: 3600
AI-Driven Cloud Cost Optimization
One of the most valuable applications of AI in infrastructure management is cost optimization. AI can analyze your configurations and suggest significant savings.
Cost Analysis Prompts
# Effective AI prompt for cost optimization
"""
Analyze this Terraform configuration for AWS cost optimization:
[Paste your Terraform code here]
Please provide:
1. IMMEDIATE SAVINGS (can implement now):
- Instance right-sizing recommendations
- Storage tier optimizations
- Unused resource identification
- Reserved Instance vs On-Demand analysis
2. ARCHITECTURAL CHANGES (require planning):
- Spot Instance opportunities for stateless workloads
- Auto-scaling policy improvements
- Multi-AZ vs Single-AZ tradeoffs
- Data transfer cost reduction strategies
3. COST MONITORING:
- Recommended CloudWatch/billing alerts
- Budget threshold suggestions
- Cost allocation tag recommendations
4. ESTIMATED SAVINGS:
- Monthly cost before optimization
- Monthly cost after optimization
- Implementation effort for each recommendation
Focus on production-safe recommendations that maintain reliability.
"""
Terraform Cost Optimization Module
# modules/cost-optimization/main.tf
# AI-generated cost optimization patterns
# Scheduled scaling for predictable workloads
resource "aws_autoscaling_schedule" "scale_down_night" {
count = var.enable_scheduled_scaling ? 1 : 0
scheduled_action_name = "scale-down-night"
min_size = var.night_min_capacity
max_size = var.night_max_capacity
desired_capacity = var.night_desired_capacity
recurrence = "0 22 * * *" # 10 PM UTC
autoscaling_group_name = var.asg_name
}
resource "aws_autoscaling_schedule" "scale_up_morning" {
count = var.enable_scheduled_scaling ? 1 : 0
scheduled_action_name = "scale-up-morning"
min_size = var.day_min_capacity
max_size = var.day_max_capacity
desired_capacity = var.day_desired_capacity
recurrence = "0 6 * * 1-5" # 6 AM UTC weekdays
autoscaling_group_name = var.asg_name
}
# S3 Intelligent Tiering for cost optimization
resource "aws_s3_bucket_intelligent_tiering_configuration" "main" {
bucket = var.s3_bucket_id
name = "cost-optimization"
tiering {
access_tier = "ARCHIVE_ACCESS"
days = 90
}
tiering {
access_tier = "DEEP_ARCHIVE_ACCESS"
days = 180
}
}
# Cost allocation tags
resource "aws_resourcegroups_group" "cost_tracking" {
name = "${var.environment}-${var.service_name}-resources"
resource_query {
query = jsonencode({
ResourceTypeFilters = ["AWS::AllSupported"]
TagFilters = [
{
Key = "Environment"
Values = [var.environment]
},
{
Key = "Service"
Values = [var.service_name]
}
]
})
}
tags = {
CostCenter = var.cost_center
Environment = var.environment
Service = var.service_name
}
}
# Budget alerts
resource "aws_budgets_budget" "service_budget" {
name = "${var.environment}-${var.service_name}-monthly"
budget_type = "COST"
limit_amount = var.monthly_budget_limit
limit_unit = "USD"
time_unit = "MONTHLY"
time_period_start = "2024-01-01_00:00"
cost_filter {
name = "TagKeyValue"
values = [
"user:Service$${var.service_name}"
]
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 80
threshold_type = "PERCENTAGE"
notification_type = "FORECASTED"
subscriber_email_addresses = var.budget_alert_emails
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 100
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_email_addresses = var.budget_alert_emails
}
}
AI-Assisted Infrastructure Testing
Testing infrastructure code is often neglected but critical. AI can help generate comprehensive test suites.
# AI-generated Terratest for infrastructure validation
package test
import (
"testing"
"time"
"github.com/gruntwork-io/terratest/modules/aws"
"github.com/gruntwork-io/terratest/modules/terraform"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)
func TestEcsFargateModule(t *testing.T) {
t.Parallel()
awsRegion := "us-east-1"
uniqueID := random.UniqueId()
terraformOptions := terraform.WithDefaultRetryableErrors(t, &terraform.Options{
TerraformDir: "../modules/ecs-fargate",
Vars: map[string]interface{}{
"environment": "test",
"service_name": fmt.Sprintf("test-%s", uniqueID),
"container_port": 8080,
"cpu": 256,
"memory": 512,
"min_capacity": 1,
"max_capacity": 2,
"domain_name": "test.example.com",
},
EnvVars: map[string]string{
"AWS_DEFAULT_REGION": awsRegion,
},
})
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
// Validate ECS cluster
clusterName := terraform.Output(t, terraformOptions, "ecs_cluster_name")
assert.NotEmpty(t, clusterName)
cluster := aws.GetEcsCluster(t, awsRegion, clusterName)
assert.Equal(t, "ACTIVE", *cluster.Status)
assert.Equal(t, "enabled", *cluster.Settings[0].Value)
// Validate ECR repository
ecrUrl := terraform.Output(t, terraformOptions, "ecr_repository_url")
assert.Contains(t, ecrUrl, "ecr")
assert.Contains(t, ecrUrl, awsRegion)
// Validate ALB
albDns := terraform.Output(t, terraformOptions, "alb_dns_name")
assert.NotEmpty(t, albDns)
// Test ALB health (with retry)
maxRetries := 10
sleepBetweenRetries := 30 * time.Second
http_helper.HttpGetWithRetry(
t,
fmt.Sprintf("https://%s/health", albDns),
nil,
200,
"OK",
maxRetries,
sleepBetweenRetries,
)
// Validate security group rules
sgId := terraform.Output(t, terraformOptions, "ecs_security_group_id")
sg := aws.GetSecurityGroup(t, sgId, awsRegion)
// Verify no wide-open ingress rules
for _, rule := range sg.IpPermissions {
for _, ipRange := range rule.IpRanges {
assert.NotEqual(t, "0.0.0.0/0", *ipRange.CidrIp,
"ECS security group should not have 0.0.0.0/0 ingress")
}
}
}
func TestCostOptimizationTags(t *testing.T) {
t.Parallel()
terraformOptions := terraform.WithDefaultRetryableErrors(t, &terraform.Options{
TerraformDir: "../modules/ecs-fargate",
Vars: map[string]interface{}{
"environment": "test",
"service_name": "cost-test",
// ... other vars
},
})
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
// Validate cost allocation tags exist
clusterArn := terraform.Output(t, terraformOptions, "ecs_service_arn")
tags := aws.GetResourceTags(t, "us-east-1", clusterArn)
requiredTags := []string{"Environment", "Service", "ManagedBy"}
for _, tag := range requiredTags {
_, exists := tags[tag]
assert.True(t, exists, fmt.Sprintf("Required tag '%s' missing", tag))
}
}
AI-Powered Infrastructure Debugging
When infrastructure deployments fail, AI can help interpret cryptic error messages and suggest fixes.
# AI Debugging Prompt Template
"""
I'm getting this Terraform error:
```
Error: creating ECS Service (arn:aws:ecs:us-east-1:123456789:service/my-cluster/my-service):
InvalidParameterException: Unable to assume the service linked role.
Please verify that the ECS service linked role exists.
```
My configuration:
[Paste relevant Terraform code]
Please help me:
1. Explain what this error means
2. Identify the root cause
3. Provide the exact fix with code
4. Explain how to prevent this in the future
"""
# AI Response would include:
# - The service-linked role needs to be created first
# - Code to create: aws iam create-service-linked-role --aws-service-name ecs.amazonaws.com
# - Or Terraform resource to add
# - Best practice: always include service-linked role creation in ECS modules
Best Practices for AI-Assisted IaC
Key Recommendations
- Never share credentials - Strip all secrets before sending code to AI
- Always validate - Run terraform validate, tfsec, and checkov on AI output
- Review security groups - AI may suggest overly permissive rules
- Test in non-prod first - Always deploy AI-generated IaC to dev/staging first
- Use specific versions - Pin provider and module versions in AI-generated code
- Implement state locking - AI may omit backend configuration
- Add comprehensive tags - Ensure cost allocation and ownership tags
- Review IAM policies - AI often suggests overly broad permissions
Conclusion
AI assistance transforms infrastructure as code development from a specialized skill into an accessible capability. By combining AI's ability to generate boilerplate quickly with human expertise for security review and architectural decisions, teams can achieve faster deployment cycles while maintaining the reliability production systems demand.
The key is treating AI as a knowledgeable assistant, not an autonomous agent. Always validate generated configurations with security scanners, test in non-production environments, and maintain human oversight for critical infrastructure decisions. With these guardrails in place, AI-assisted IaC delivers significant productivity gains while managing risk appropriately.
For more insights on integrating AI into your development workflows, explore our guide on Integrating AI into CI/CD Pipelines and Docker Containerization Anti-Patterns.