Deploy AI Automation on AWS + Kubernetes: Complete Production Guide

You've built your AI automation service. Now you need to deploy it so it runs reliably, scales automatically, and doesn't bankrupt you on cloud costs. Here's the production stack we use at BuildPilot Labs.

The Stack

AWS EKS — managed Kubernetes (no control plane headaches)
Docker — containerised Go/Python services
GitHub Actions — CI/CD pipeline
Prometheus + Grafana — monitoring and alerting
AWS SQS — job queue for async AI tasks
PostgreSQL (RDS) — persistent storage
Redis (ElastiCache) — caching and rate limiting

Step 1: Containerise Your Service

# Multi-stage build for Go AI agent
FROM golang:1.22-alpine AS builder
WORKDIR /app
COPY go.* ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 go build -o /agent ./cmd/agent

FROM alpine:3.19
RUN apk add --no-cache ca-certificates
COPY --from=builder /agent /agent
EXPOSE 8080
CMD ["/agent"]

Step 2: Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ai-agent
spec:
  replicas: 3
  selector:
    matchLabels:
      app: ai-agent
  template:
    spec:
      containers:
      - name: agent
        image: your-ecr-repo/ai-agent:latest
        resources:
          requests: { cpu: "250m", memory: "512Mi" }
          limits:   { cpu: "1000m", memory: "1Gi" }
        env:
        - name: ANTHROPIC_API_KEY
          valueFrom:
            secretKeyRef:
              name: ai-secrets
              key: anthropic-key

Step 3: Auto-scaling for AI Workloads

AI tasks are bursty — you might have 10 requests one minute and 1,000 the next. Use Horizontal Pod Autoscaler based on the SQS queue depth, not CPU:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: External
    external:
      metric:
        name: sqs_queue_depth
      target:
        type: AverageValue
        averageValue: "5"

Step 4: Cost Optimisation

Spot instances for AI worker nodes (70% cheaper, use with graceful shutdown)
Right-size pods — AI tasks are memory-heavy, not CPU-heavy
Cache LLM responses — identical inputs get cached results (saves API costs)
Queue batching — batch multiple small requests into one LLM call

Step 5: Monitoring AI-Specific Metrics

Track: LLM API latency, token usage per request, error rate by model, queue depth, cost per task. Alert on: API errors > 5%, latency p99 > 10s, daily cost exceeding budget.

Need help deploying your AI automation? We specialise in production Kubernetes deployments.