Kubernetes Deployment

Enterprise-grade deployment on Kubernetes with high availability and auto-scaling

Platform: Kubernetes
Cost: $50-200/month
Time: 2 hours
Difficulty: Advanced

Deploy OpenClaw on Kubernetes

An enterprise-grade deployment with high availability, auto-scaling, and declarative infrastructure. This guide walks through the full Kubernetes resource stack: namespace, secrets, persistent storage, deployment, service, ingress, and horizontal pod autoscaling.

Estimated time: 2 hours | Cost: $50-200/month | Difficulty: Advanced


Prerequisites

Before you begin, make sure you have:

  • A Kubernetes cluster -- any of the following:
    • Managed: AWS EKS, Google GKE, or Azure AKS
    • Self-hosted: k3s, kubeadm, or Rancher
    • Local development: minikube or kind (for testing only)
  • kubectl installed and configured to communicate with your cluster (kubectl cluster-info should succeed)
  • Helm v3 installed (optional, for the Helm chart section)
  • An API key for your LLM provider (Anthropic, OpenAI, etc.)
  • An Ingress controller installed in the cluster (e.g., ingress-nginx) if you want external HTTP/HTTPS access

Step 1: Namespace and Secrets

Create a dedicated namespace to isolate OpenClaw resources:

kubectl create namespace openclaw

Store your API keys as a Kubernetes Secret. Never put API keys directly in Deployment manifests or ConfigMaps:

kubectl create secret generic openclaw-secrets \
  --namespace openclaw \
  --from-literal=ANTHROPIC_API_KEY='your-anthropic-api-key-here' \
  --from-literal=OPENAI_API_KEY='your-openai-api-key-here'

Verify the secret was created:

kubectl get secrets -n openclaw

Tip: For production clusters, consider using an external secrets manager like AWS Secrets Manager, HashiCorp Vault, or the External Secrets Operator instead of plain Kubernetes Secrets. Kubernetes Secrets are base64-encoded (not encrypted) by default.


Step 2: Deployment Manifest

Create a file named openclaw-deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: openclaw
  namespace: openclaw
  labels:
    app: openclaw
spec:
  replicas: 2
  selector:
    matchLabels:
      app: openclaw
  template:
    metadata:
      labels:
        app: openclaw
    spec:
      containers:
        - name: openclaw
          image: openclaw/openclaw:latest  # Pin to a specific tag in production
          ports:
            - containerPort: 3111
              name: http
          env:
            - name: OPENCLAW_PORT
              value: "3111"
            - name: OPENCLAW_HOST
              value: "0.0.0.0"
            - name: OPENCLAW_LOG_LEVEL
              value: "info"
          envFrom:
            - secretRef:
                name: openclaw-secrets
          resources:
            requests:
              memory: "256Mi"
              cpu: "250m"
            limits:
              memory: "512Mi"
              cpu: "500m"
          livenessProbe:
            httpGet:
              path: /health
              port: http
            initialDelaySeconds: 15
            periodSeconds: 20
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /health
              port: http
            initialDelaySeconds: 5
            periodSeconds: 10
            failureThreshold: 3
          volumeMounts:
            - name: openclaw-data
              mountPath: /home/openclaw/.openclaw
      volumes:
        - name: openclaw-data
          persistentVolumeClaim:
            claimName: openclaw-pvc

Key decisions in this manifest:

FieldValueRationale
replicas2Baseline HA -- survives a single pod failure
resources.requests.memory256MiMinimum memory the scheduler reserves per pod
resources.limits.memory512MiHard ceiling to prevent a runaway process from starving other workloads
resources.requests.cpu250mOne quarter of a CPU core guaranteed
resources.limits.cpu500mBurst up to half a core
livenessProbe/health, 20s intervalRestarts the pod if it becomes unresponsive
readinessProbe/health, 10s intervalRemoves the pod from Service endpoints until it is ready to accept traffic

Step 3: Persistent Storage

OpenClaw stores configuration, skill data, and local state on disk. A PersistentVolumeClaim ensures this data survives pod restarts.

Create a file named openclaw-pvc.yaml:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: openclaw-pvc
  namespace: openclaw
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi
  storageClassName: gp3   # Adjust for your cluster (gp2/gp3 on EKS, standard on GKE, managed-csi on AKS)

Note: ReadWriteOnce means the volume can be mounted by pods on a single node. If you need multi-node access (e.g., replicas on different nodes reading the same data), use ReadWriteMany with a storage class that supports it (EFS on AWS, Filestore on GCP), or redesign the application to use a shared database instead of local files.


Step 4: Service and Ingress

ClusterIP Service

Expose the OpenClaw pods to other resources inside the cluster.

Create a file named openclaw-service.yaml:

apiVersion: v1
kind: Service
metadata:
  name: openclaw
  namespace: openclaw
  labels:
    app: openclaw
spec:
  type: ClusterIP
  selector:
    app: openclaw
  ports:
    - port: 80
      targetPort: http
      protocol: TCP
      name: http

Ingress with TLS

Route external traffic to the Service. This example assumes you are using the ingress-nginx controller and cert-manager for automatic TLS certificates.

Create a file named openclaw-ingress.yaml:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: openclaw
  namespace: openclaw
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
    nginx.ingress.kubernetes.io/proxy-read-timeout: "86400"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "86400"
    nginx.ingress.kubernetes.io/proxy-body-size: "10m"
    # WebSocket support
    nginx.ingress.kubernetes.io/configuration-snippet: |
      proxy_set_header Upgrade $http_upgrade;
      proxy_set_header Connection "upgrade";
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - openclaw.yourdomain.com   # Replace with your domain
      secretName: openclaw-tls
  rules:
    - host: openclaw.yourdomain.com  # Replace with your domain
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: openclaw
                port:
                  number: 80

No domain? Skip the Ingress and use kubectl port-forward svc/openclaw -n openclaw 3111:80 for local access during development.


Step 5: Deploy

Apply all the manifests in order:

# Apply the PVC first so the volume is ready before the Deployment references it
kubectl apply -f openclaw-pvc.yaml

# Apply the Deployment, Service, and Ingress
kubectl apply -f openclaw-deployment.yaml
kubectl apply -f openclaw-service.yaml
kubectl apply -f openclaw-ingress.yaml   # Skip if you have no Ingress controller

Verify everything is running:

# Check pod status
kubectl get pods -n openclaw

# Watch pods come up in real time
kubectl get pods -n openclaw -w

# Check the Service endpoints
kubectl get endpoints openclaw -n openclaw

# View logs from a specific pod
kubectl logs -n openclaw -l app=openclaw --tail=50

# Describe a pod if it is stuck in Pending or CrashLoopBackOff
kubectl describe pod -n openclaw -l app=openclaw

A healthy deployment looks like this:

NAME                        READY   STATUS    RESTARTS   AGE
openclaw-6d4f8b7c9f-abc12   1/1     Running   0          2m
openclaw-6d4f8b7c9f-def34   1/1     Running   0          2m

Step 6: Scaling and High Availability

Horizontal Pod Autoscaler

Automatically scale the number of pods based on CPU utilization. Requires the Metrics Server to be installed in your cluster (most managed clusters include it by default).

Create a file named openclaw-hpa.yaml:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: openclaw
  namespace: openclaw
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: openclaw
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Pod Disruption Budget

Ensure at least one pod is always available during voluntary disruptions (node drains, cluster upgrades):

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: openclaw
  namespace: openclaw
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: openclaw

Anti-affinity rules

Spread replicas across different nodes so a single node failure does not take down all pods. Add this to the Deployment's spec.template.spec:

      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchExpressions:
                    - key: app
                      operator: In
                      values:
                        - openclaw
                topologyKey: kubernetes.io/hostname

We use preferredDuringScheduling (soft rule) instead of requiredDuringScheduling so the scheduler can still place pods if you have fewer nodes than replicas.


Step 7: Helm Chart (Optional)

For teams that deploy OpenClaw across multiple environments (dev, staging, production), a Helm chart parameterizes the manifests above.

A minimal values.yaml:

# values.yaml
replicaCount: 2

image:
  repository: openclaw/openclaw
  tag: "1.0.0"          # Pin a specific version
  pullPolicy: IfNotPresent

service:
  type: ClusterIP
  port: 80

ingress:
  enabled: false
  hostname: openclaw.yourdomain.com
  tls: true
  clusterIssuer: letsencrypt-prod

resources:
  requests:
    cpu: 250m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 512Mi

persistence:
  enabled: true
  size: 5Gi
  storageClass: gp3

autoscaling:
  enabled: true
  minReplicas: 2
  maxReplicas: 10
  targetCPUUtilization: 70

secrets:
  anthropicApiKey: ""    # Set via --set or a secrets file, never commit to Git
  openaiApiKey: ""

Install with environment-specific overrides:

helm install openclaw ./charts/openclaw \
  --namespace openclaw \
  --create-namespace \
  --values values.yaml \
  --set secrets.anthropicApiKey='your-key-here'

Upgrade after a configuration or image change:

helm upgrade openclaw ./charts/openclaw \
  --namespace openclaw \
  --values values.yaml

Monitoring

Prometheus ServiceMonitor

If you run the Prometheus Operator (kube-prometheus-stack), create a ServiceMonitor to automatically scrape OpenClaw metrics:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: openclaw
  namespace: openclaw
  labels:
    release: kube-prometheus-stack  # Must match your Prometheus Operator's label selector
spec:
  selector:
    matchLabels:
      app: openclaw
  endpoints:
    - port: http
      path: /metrics
      interval: 30s

Grafana dashboard

Import or create a Grafana dashboard that visualizes:

  • Request rate and latency (from OpenClaw's /metrics endpoint)
  • Pod CPU and memory usage (from Kubernetes metrics)
  • Restart counts and probe failures
  • HPA scaling events

Log aggregation with Loki

If you run Grafana Loki (part of the Grafana stack), logs from OpenClaw pods are automatically collected by Promtail or the Grafana Agent. Query them in Grafana with:

{namespace="openclaw", app="openclaw"}

Filter for errors:

{namespace="openclaw", app="openclaw"} |= "error"

For clusters without Loki, use kubectl logs:

# Tail logs from all OpenClaw pods simultaneously
kubectl logs -n openclaw -l app=openclaw -f --tail=100

# Logs from a specific pod
kubectl logs -n openclaw openclaw-6d4f8b7c9f-abc12 --tail=200

Cost Considerations

Kubernetes is powerful but not cheap. Here is a realistic cost breakdown for managed clusters:

ComponentEKS (AWS)GKE (Google)AKS (Azure)
Control plane$73/month$73/month (Standard)Free (Basic)
Worker nodes (2x t3.medium / e2-medium / B2s)~$60/month~$50/month~$60/month
Load balancer~$18/month~$18/month~$18/month
Persistent storage (5 GB)~$1/month~$1/month~$1/month
Total estimate~$152/month~$142/month~$79/month

Is Kubernetes overkill for you? If you are a single user or a small team running one OpenClaw instance, the answer is probably yes. The Ubuntu Server guide above gives you the same result at a fraction of the cost. Kubernetes makes sense when you need multi-tenant isolation, auto-scaling across many agents, zero-downtime deployments, or integration with an existing Kubernetes-based platform.

For lower-cost Kubernetes, consider:

  • k3s on a single VPS: A lightweight Kubernetes distribution that runs on a $6-12/month VPS. You lose HA but gain the Kubernetes API and ecosystem.
  • GKE Autopilot: Pay only for pod resources, no node management. Can be cheaper for bursty workloads.
  • AKS with the free control plane: Azure does not charge for the Kubernetes control plane on the Basic tier, saving ~$73/month versus EKS or GKE Standard.

Troubleshooting

Pods stuck in Pending

kubectl describe pod -n openclaw -l app=openclaw

Common causes:

  • Insufficient resources: The cluster does not have a node with enough free CPU or memory. Scale up your node pool or reduce the resource requests.
  • PVC not bound: The PersistentVolumeClaim cannot find a matching PersistentVolume. Check kubectl get pvc -n openclaw and verify the storageClassName matches an available StorageClass (kubectl get sc).

Pods in CrashLoopBackOff

kubectl logs -n openclaw -l app=openclaw --previous

Common causes:

  • Missing secrets: The openclaw-secrets Secret does not exist or is missing expected keys. Verify with kubectl get secret openclaw-secrets -n openclaw -o yaml.
  • Invalid API key: OpenClaw starts but immediately fails authentication with the LLM provider.
  • Health check failure: If the /health endpoint is not implemented or returns an error, the liveness probe kills the pod repeatedly. Temporarily remove the probes to debug.

Ingress returns 404 or 503

# Verify the Ingress resource exists and has an address
kubectl get ingress -n openclaw

# Check that the Service has endpoints (backing pods)
kubectl get endpoints openclaw -n openclaw

# Check ingress controller logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=50

Common causes:

  • The host in the Ingress does not match the domain you are requesting
  • The Service selector does not match the pod labels
  • The Ingress controller is not installed or is in a different namespace

HPA not scaling

kubectl get hpa -n openclaw

If the TARGETS column shows <unknown>/70%, the Metrics Server is not installed or not reporting metrics. Install it:

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

On some clusters (especially local ones like minikube), you may need to add --kubelet-insecure-tls to the Metrics Server deployment args.