DevOps & Cloud

Chaos Engineering with Litmus

GRATIS

Kuasai Chaos Engineering di Kubernetes dengan Litmus — experiments, chaos hub, probes, faults, resilience scoring, dan practice continuous validation

1. Prinsip Chaos Engineering

Chaos Engineering adalah disiplin eksperimen pada sistem produksi untuk membangun kepercayaan bahwa sistem mampu bertahan dalam kondisi yang tidak terduga. Tujuannya bukan untuk "merusak" sistem, tapi untuk menemukan kelemahan sebelum menemukan Anda.

Chaos Engineering Workflow
📊
Steady State
Define normal behavior
Metrics baseline
→ Hypothesize →
💭
Hypothesis
System should handle
this fault gracefully
→ Inject →
💥
Inject Fault
Pod kill, network delay,
CPU stress, disk fill
→ Observe →
🔍
Analyze
Verify hypothesis
Identify weaknesses
Improve resilience

1.1 Empat Prinsip Chaos Engineering

2. LitmusChaos Overview

LitmusChaos adalah CNCF chaos engineering framework untuk Kubernetes. Litmus menyediakan library fault injection (ChaosHub), orchestration engine, dan observability untuk eksperimen chaos yang terstruktur.

KomponenFungsiDetail
ChaosCenterWeb UI & managementDashboard untuk manage experiments
ChaosEngineCRD orchestrationMendefinisikan eksperimen yang akan dijalankan
ChaosExperimentCRD fault definitionDefinisikan fault spesifik (pod delete, CPU stress)
ChaosResultCRD resultHasil eksekusi eksperimen
ChaosHubExperiment libraryKatalog eksperimen siap pakai
Chaos OperatorControllerReconcile ChaosEngine, inject faults
ChaosRunnerExecution podPod yang menjalankan fault injection

3. Instalasi Litmus

Terminal — Install LitmusChaos
# Tambahkan Helm repo
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm repo update

# Install Litmus Control Plane (ChaosCenter)
helm install litmus litmuschaos/litmus-control-plane \
  --namespace litmus --create-namespace \
  --set portal.frontend.service.type=NodePort

# Atau install hanya CRDs + Operator (tanpa UI)
helm install litmus litmuschaos/litmus \
  --namespace litmus --create-namespace

# Verifikasi instalasi
kubectl get pods -n litmus
# NAME                                    READY   STATUS
# litmus-server-xxx                       1/1     Running
# litmus-frontend-xxx                     1/1     Running
# litmus-mongo-xxx                        1/1     Running
# chaos-operator-ce-xxx                   1/1     Running

# Akses ChaosCenter UI
kubectl port-forward svc/litmusportal-frontend -n litmus 3030:80
# Buka http://localhost:3030
# Default credentials: admin / litmus

# Install litmusctl CLI
curl -sL https://litmusctl.litmuschaos.io/install.sh | bash
# Atau download dari GitHub releases

# Login via CLI
litmusctl config set-account --endpoint="http://localhost:3030" \
  --username="admin" --password="litmus"
Litmus RBAC setup
# ServiceAccount dan RBAC untuk chaos experiments
apiVersion: v1
kind: ServiceAccount
metadata:
  name: litmus-admin
  namespace: litmus

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: litmus-admin
rules:
  - apiGroups: [""]
    resources: ["pods", "pods/exec", "pods/log", "events", "services",
      "configmaps", "secrets", "persistentvolumeclaims"]
    verbs: ["create", "delete", "get", "list", "patch", "update"]
  - apiGroups: ["apps"]
    resources: ["deployments", "statefulsets", "daemonsets"]
    verbs: ["create", "delete", "get", "list", "patch", "update"]
  - apiGroups: ["litmuschaos.io"]
    resources: ["*"]
    verbs: ["*"]
  - apiGroups: ["batch"]
    resources: ["jobs"]
    verbs: ["create", "delete", "get", "list"]
  - apiGroups: [""]
    resources: ["nodes"]
    verbs: ["get", "list"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: litmus-admin
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: litmus-admin
subjects:
  - kind: ServiceAccount
    name: litmus-admin
    namespace: litmus

4. Experiments

ChaosExperiment mendefinisikan fault yang akan di-inject. Litmus memiliki library eksperimen yang bisa diinstall dari ChaosHub.

Pod Delete Experiment
# Install eksperimen dari ChaosHub
# litmusctl get chaos-experiments --hub-name=ChaosHub

# Atau install manual
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: pod-delete-engine
  namespace: default
spec:
  # Target aplikasi
  appinfo:
    appns: default
    applabel: app=my-app
    appkind: deployment

  # ServiceAccount yang digunakan
  serviceAccount: litmus-admin

  # Experiments yang akan dijalankan
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            # Jumlah pod yang akan di-delete
            - name: TOTAL_CHAOS_DURATION
              value: '30'
            # Interval antara pod deletion
            - name: CHAOS_INTERVAL
              value: '10'
            # Force delete (tanpa graceful termination)
            - name: FORCE
              value: 'false'
            # Target pod count
            - name: TARGET_PODS
              value: ''
            # Pod affected percentage
            - name: PODS_AFFECTED_PERC
              value: ''
        # Probes untuk verifikasi
        probes:
          - name: "Verify app is running"
            type: "httpProbe"
            mode: "Continuous"
            httpProbe/inputs:
              url: "http://my-app.default.svc:8080/health"
              method:
                get:
                  criteria: "=="
                  responseCode: "200"
            runProperties:
              probeTimeout: 5
              interval: 2
              retry: 3

  # Experiment execution policy
  engineState: "active"
  chaosServiceAccount: litmus-admin
💡 Mulai dari Staging

Jangan langsung menjalankan chaos experiments di production. Mulai dari environment staging yang menyerupai production. Pastikan Anda memiliki monitoring yang memadai (metrics, logs, alerts) sebelum menjalankan eksperimen di production.

5. Chaos Hub

ChaosHub adalah library eksperimen yang bisa diinstall langsung. Litmus memiliki lebih dari 50+ fault types yang tersedia.

Install experiments dari ChaosHub
# Install eksperimen spesifik ke cluster
kubectl apply -f https://hub.litmuschaos.io/api/chaos/1.13.0?file=charts/generic/experiments.yaml

# Kategori eksperimen yang tersedia di ChaosHub:

# === Kubernetes Faults ===
# pod-delete        — Hapus pods secara acak
# pod-cpu-hog       — Stress CPU pada pods
# pod-memory-hog    — Exhaust memory pada pods
# pod-network-loss  — Simulasikan packet loss
# pod-network-delay — Tambahkan network latency
# pod-dns-error     — Simulasikan DNS failures

# === Application Faults ===
# node-drain        — Drain node dari pods
# node-cpu-hog     — Stress CPU pada node
# node-memory-hog  — Exhaust memory node
# node-taint       — Taint nodes
# kubelet-service-kill — Kill kubelet service

# === Cloud Provider Faults ===
# ec2-terminate     — Terminate EC2 instances
# ebs-loss          — Detach EBS volumes
# gcp-vm-stop       — Stop GCP VM instances
# azure-vm-stop     — Stop Azure VM instances

# === Network Faults ===
# network-partition — Simulasikan network partition
# network-duplication — Duplicate network packets
# network-jitter    — Add network jitter
# bandwidth-limit   — Limit network bandwidth

# Install eksperimen spesifik
kubectl apply -f https://hub.litmuschaos.io/api/chaos/1.13.0?file=charts/generic/experiments.yaml -n litmus

# Cek eksperimen yang terinstall
kubectl get chaosexperiments -n litmus
Network delay experiment
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: network-delay-engine
  namespace: default
spec:
  appinfo:
    appns: default
    applabel: app=api-gateway
    appkind: deployment
  experiments:
    - name: pod-network-delay
      spec:
        components:
          env:
            - name: NETWORK_LATENCY
              value: '2000'  # 2 detik delay
            - name: JITTER
              value: '500'   # 500ms jitter
            - name: DESTINATION_IPS
              value: ''      # Semua IPs
            - name: DESTINATION_HOSTS
              value: ''      # Semua hosts
            - name: CONTAINER_RUNTIME
              value: 'containerd'
            - name: TOTAL_CHAOS_DURATION
              value: '60'
            - name: CHAOS_INTERVAL
              value: '10'
        probes:
          - name: "Verify latency is bounded"
            type: "promProbe"
            mode: "Edge"
            promProbe/inputs:
              endpoint: "http://prometheus.monitoring:9090"
              query: |
                histogram_quantile(0.99,
                  sum(rate(http_request_duration_seconds_bucket[1m]))) < 5
              comparator:
                type: "float"
                criteria: "<"
                value: "5"
            runProperties:
              probeTimeout: 10
              interval: 5

6. Probes & Assertions

Probes adalah assertions yang dijalankan sebelum, selama, dan setelah eksperimen. Probes memverifikasi bahwa sistem berperilaku sesuai harapan selama fault injection.

Probe TypeDeskripsiContoh
httpProbeValidasi HTTP endpointHealth check returns 200
cmdProbeValidasi via command executionkubectl get pods returns Running
k8sProbeValidasi Kubernetes resource stateDeployment ready replicas > 2
promProbeValidasi via PromQL queryError rate < 5%
Probe examples
# Probes di dalam ChaosEngine experiment spec
experiments:
  - name: pod-delete
    spec:
      probes:
        # Probe 1: HTTP health check (continuous)
        - name: "App is responding"
          type: "httpProbe"
          mode: "Continuous"
          httpProbe/inputs:
            url: "http://my-app.default:8080/health"
            method:
              get:
                criteria: "=="
                responseCode: "200"
          runProperties:
            probeTimeout: 5
            interval: 2
            retry: 3
            probePollingInterval: 2

        # Probe 2: K8s resource validation (edge)
        - name: "Deployment has minimum replicas"
          type: "k8sProbe"
          mode: "Edge"
          k8sProbe/inputs:
            command:
              command: ["kubectl", "get", "deploy", "my-app",
                "-o", "jsonpath='{.status.readyReplicas}'"]
            comparator:
              type: "int"
              criteria: ">="
              value: "2"
          runProperties:
            probeTimeout: 10

        # Probe 3: Prometheus metric validation (edge)
        - name: "Error rate below threshold"
          type: "promProbe"
          mode: "Edge"
          promProbe/inputs:
            endpoint: "http://prometheus.monitoring:9090"
            query: |
              sum(rate(http_requests_total{status=~"5.."}[2m]))
              / sum(rate(http_requests_total[2m])) * 100
            comparator:
              type: "float"
              criteria: "<"
              value: "5"
          runProperties:
            probeTimeout: 15
            interval: 10

        # Probe 4: Command validation (onChaos)
        - name: "DB connection exists"
          type: "cmdProbe"
          mode: "OnChaos"
          cmdProbe/inputs:
            command: "pg_isready -h postgres.default -p 5432"
            comparator:
              type: "string"
              criteria: "contains"
              value: "accepting"
          runProperties:
            probeTimeout: 10

7. ChaosEngine & Faults

Comprehensive ChaosEngine
# ChaosEngine lengkap dengan multiple experiments
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: comprehensive-resilience-test
  namespace: default
  labels:
    app.kubernetes.io/part-of: resilience-testing
    environment: staging
spec:
  engineState: active
  chaosServiceAccount: litmus-admin
  appinfo:
    appns: default
    applabel: app=my-app,env=staging
    appkind: deployment

  experiments:
    # Experiment 1: Pod Delete
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: '30'
            - name: CHAOS_INTERVAL
              value: '10'
            - name: FORCE
              value: 'false'
        probes:
          - name: "App recovers after pod delete"
            type: "httpProbe"
            mode: "Edge"
            httpProbe/inputs:
              url: "http://my-app:8080/health"
              method:
                get:
                  criteria: "=="
                  responseCode: "200"
            runProperties:
              probeTimeout: 30
              interval: 5
              retry: 10

    # Experiment 2: CPU Stress
    - name: pod-cpu-hog
      spec:
        components:
          env:
            - name: CPU_CORES
              value: '2'
            - name: CPU_LOAD
              value: '100'
            - name: TOTAL_CHAOS_DURATION
              value: '60'
        probes:
          - name: "Latency still acceptable under CPU pressure"
            type: "promProbe"
            mode: "Edge"
            promProbe/inputs:
              endpoint: "http://prometheus.monitoring:9090"
              query: |
                histogram_quantile(0.99,
                  sum(rate(http_request_duration_seconds_bucket[1m]))) < 2
              comparator:
                type: "float"
                criteria: "<"
                value: "2"

    # Experiment 3: Network Loss
    - name: pod-network-loss
      spec:
        components:
          env:
            - name: NETWORK_PACKET_LOSS_PERCENTAGE
              value: '30'
            - name: DESTINATION_IPS
              value: ''
            - name: DESTINATION_HOSTS
              value: ''
            - name: CONTAINER_RUNTIME
              value: 'containerd'
            - name: TOTAL_CHAOS_DURATION
              value: '60'
        probes:
          - name: "App handles packet loss gracefully"
            type: "k8sProbe"
            mode: "Edge"
            k8sProbe/inputs:
              command:
                command: ["kubectl", "get", "deploy", "my-app",
                  "-o", "jsonpath='{.status.readyReplicas}'"]
              comparator:
                type: "int"
                criteria: ">="
                value: "1"

  # Schedule — jalankan eksperimen secara periodik
  schedule:
    repeat: weekly
    time: "02:00"  # Jam 2 pagi
    dayOfWeek: "Saturday"

  # Rollback jika probes gagal
  jobCleanUpPolicy: delete

8. Resilience Scoring

Resilience Score adalah metrik yang mengukur seberapa tangguh aplikasi Anda terhadap kegagalan. Score dihitung berdasarkan pass/fail ratio dari semua probes yang dijalankan selama eksperimen.

Resilience score calculation
# Formula Resilience Score
# RS = (Jumlah probe yang passed / Total probes) × 100

# Contoh: ChaosEngine dengan 3 experiments
# Experiment 1 (pod-delete): 2/3 probes passed → 66.7%
# Experiment 2 (cpu-stress): 3/3 probes passed → 100%
# Experiment 3 (net-loss): 1/3 probes passed → 33.3%

# Overall Resilience Score = (66.7 + 100 + 33.3) / 3 = 66.7%

# Ambil hasil dari ChaosResult CRD
# kubectl get chaosresult -n default -o yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosResult
metadata:
  name: pod-delete-pod-delete
  namespace: default
spec:
  engine: comprehensive-resilience-test
  experiment: pod-delete
status:
  experimentStatus:
    phase: Completed
    verdict: Pass
    probeSuccessPercentage: "66.7"
    runningPhase: Completed
  # History dari setiap run
  history:
    - verdict: Pass
      probeSuccessPercentage: "66.7"
      updatedAt: "2026-06-29T02:15:00Z"
    - verdict: Fail
      probeSuccessPercentage: "33.3"
      updatedAt: "2026-06-22T02:15:00Z"

# Monitoring resilience score di Grafana
# PromQL query untuk Litmus metrics
# litmus_experiment_verdict{experiment="pod-delete"}
# litmus_probe_success_percentage

# Target resilience score per environment:
# Development: 60%+ (acceptable)
# Staging: 80%+ (good)
# Production: 90%+ (excellent)

8.1 CI/CD Integration

GitHub Actions — Chaos in CI/CD
# .github/workflows/chaos-test.yml
name: Chaos Engineering Test

on:
  pull_request:
    branches: [main]
  schedule:
    - cron: '0 2 * * 6'  # Setiap Sabtu jam 2 pagi

jobs:
  chaos-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Kubernetes cluster
        uses: helm/kind-action@v1

      - name: Install Litmus
        run: |
          helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
          helm install litmus litmuschaos/litmus \
            --namespace litmus --create-namespace

      - name: Deploy application
        run: kubectl apply -f k8s/

      - name: Run chaos experiments
        run: |
          kubectl apply -f chaos/pod-delete.yaml
          # Tunggu eksperimen selesai
          kubectl wait --for=condition=Completed \
            chaosresult/pod-delete-pod-delete \
            --timeout=300s

      - name: Check resilience score
        run: |
          SCORE=$(kubectl get chaosresult -o jsonpath=
            '{.items[0].status.experimentStatus.probeSuccessPercentage}')
          echo "Resilience Score: $SCORE%"
          if (( $(echo "$SCORE < 80" | bc -l) )); then
            echo "❌ Resilience score below threshold!"
            exit 1
          fi
          echo "✅ Resilience check passed"
⚠️ Safety Guidelines

Chaos Engineering bukan tentang merusak sistem. Pastikan selalu: (1) Punya rollback plan, (2) Blast radius dibatasi, (3) Monitoring aktif, (4) Tim yang aware, (5) Mulai dari staging. Jangan pernah menjalankan chaos experiment pada sistem yang Anda tidak pahami sepenuhnya.

9. Quiz: Uji Pemahamanmu!

Setelah membaca tutorial di atas, jawablah 5 pertanyaan berikut:

Pertanyaan 1: Apa tujuan utama Chaos Engineering?

a) Merusak sistem produksi untuk testing
b) Mengeksperimen pada sistem untuk menemukan kelemahan sebelum terjadi kegagalan yang tidak terduga
c) Menggantikan unit testing
d) Memonitoring performa aplikasi

Pertanyaan 2: Apa fungsi Probes di LitmusChaos?

a) Meng-inject faults ke dalam sistem
b) Assertions yang memverifikasi sistem berperilaku sesuai harapan sebelum, selama, dan setelah eksperimen
c) Mengumpulkan logs dari pods
d) Mengelola Kubernetes RBAC

Pertanyaan 3: Apa yang diukur oleh Resilience Score?

a) Response time aplikasi
b) Jumlah CPU dan memory yang digunakan
c) Persentase probes yang berhasil selama eksperimen — mengukur ketangguhan sistem terhadap faults
d) Jumlah pods yang running

Pertanyaan 4: Apa fungsi ChaosHub di Litmus?

a) Dashboard untuk monitoring cluster
b) Library katalog eksperimen yang bisa diinstall dan digunakan langsung
c) Tempat menyimpan hasil eksperimen
d) CI/CD pipeline untuk chaos testing

Pertanyaan 5: Mengapa penting menjalankan chaos experiments di staging sebelum production?

a) Karena Litmus tidak mendukung production
b) Untuk memvalidasi eksperimen, memahami blast radius, dan memastikan monitoring siap sebelum mengambil risiko di production
c) Karena staging lebih murah
d) Karena production tidak bisa diakses oleh Litmus
← SebelumnyaNixOS and Nix for Containers Selanjutnya →Backstage Developer Portal
🔍 Zoom
100%
🎨 Tema