You've built your AI automation service. Now you need to deploy it so it runs reliably, scales automatically, and doesn't bankrupt you on cloud costs. Here's the production stack we use at BuildPilot Labs.
The Stack
- AWS EKS — managed Kubernetes (no control plane headaches)
- Docker — containerised Go/Python services
- GitHub Actions — CI/CD pipeline
- Prometheus + Grafana — monitoring and alerting
- AWS SQS — job queue for async AI tasks
- PostgreSQL (RDS) — persistent storage
- Redis (ElastiCache) — caching and rate limiting
Step 1: Containerise Your Service
# Multi-stage build for Go AI agent
FROM golang:1.22-alpine AS builder
WORKDIR /app
COPY go.* ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 go build -o /agent ./cmd/agent
FROM alpine:3.19
RUN apk add --no-cache ca-certificates
COPY --from=builder /agent /agent
EXPOSE 8080
CMD ["/agent"]
Step 2: Kubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: ai-agent
spec:
replicas: 3
selector:
matchLabels:
app: ai-agent
template:
spec:
containers:
- name: agent
image: your-ecr-repo/ai-agent:latest
resources:
requests: { cpu: "250m", memory: "512Mi" }
limits: { cpu: "1000m", memory: "1Gi" }
env:
- name: ANTHROPIC_API_KEY
valueFrom:
secretKeyRef:
name: ai-secrets
key: anthropic-key
Step 3: Auto-scaling for AI Workloads
AI tasks are bursty — you might have 10 requests one minute and 1,000 the next. Use Horizontal Pod Autoscaler based on the SQS queue depth, not CPU:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
minReplicas: 2
maxReplicas: 20
metrics:
- type: External
external:
metric:
name: sqs_queue_depth
target:
type: AverageValue
averageValue: "5"
Step 4: Cost Optimisation
- Spot instances for AI worker nodes (70% cheaper, use with graceful shutdown)
- Right-size pods — AI tasks are memory-heavy, not CPU-heavy
- Cache LLM responses — identical inputs get cached results (saves API costs)
- Queue batching — batch multiple small requests into one LLM call
Step 5: Monitoring AI-Specific Metrics
Track: LLM API latency, token usage per request, error rate by model, queue depth, cost per task. Alert on: API errors > 5%, latency p99 > 10s, daily cost exceeding budget.
Need help deploying your AI automation? We specialise in production Kubernetes deployments.