1. Prinsip Chaos Engineering
Chaos Engineering adalah disiplin eksperimen pada sistem produksi untuk membangun kepercayaan bahwa sistem mampu bertahan dalam kondisi yang tidak terduga. Tujuannya bukan untuk "merusak" sistem, tapi untuk menemukan kelemahan sebelum menemukan Anda.
Metrics baseline
this fault gracefully
CPU stress, disk fill
Identify weaknesses
Improve resilience
1.1 Empat Prinsip Chaos Engineering
- Build Hypothesis around Steady State: Definisikan "normal" berdasarkan metrik (latency, error rate, throughput)
- Vary Real-world Events: Simulasikan kegagalan yang benar-benar terjadi (pod crash, network partition, resource exhaustion)
- Run Experiments in Production: Lakukan di production (atau staging yang menyerupai production) untuk hasil yang valid
- Automate Experiments to Run Continuously: Integrasikan ke CI/CD pipeline untuk continuous resilience validation
2. LitmusChaos Overview
LitmusChaos adalah CNCF chaos engineering framework untuk Kubernetes. Litmus menyediakan library fault injection (ChaosHub), orchestration engine, dan observability untuk eksperimen chaos yang terstruktur.
| Komponen | Fungsi | Detail |
|---|---|---|
| ChaosCenter | Web UI & management | Dashboard untuk manage experiments |
| ChaosEngine | CRD orchestration | Mendefinisikan eksperimen yang akan dijalankan |
| ChaosExperiment | CRD fault definition | Definisikan fault spesifik (pod delete, CPU stress) |
| ChaosResult | CRD result | Hasil eksekusi eksperimen |
| ChaosHub | Experiment library | Katalog eksperimen siap pakai |
| Chaos Operator | Controller | Reconcile ChaosEngine, inject faults |
| ChaosRunner | Execution pod | Pod yang menjalankan fault injection |
3. Instalasi Litmus
# Tambahkan Helm repo helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/ helm repo update # Install Litmus Control Plane (ChaosCenter) helm install litmus litmuschaos/litmus-control-plane \ --namespace litmus --create-namespace \ --set portal.frontend.service.type=NodePort # Atau install hanya CRDs + Operator (tanpa UI) helm install litmus litmuschaos/litmus \ --namespace litmus --create-namespace # Verifikasi instalasi kubectl get pods -n litmus # NAME READY STATUS # litmus-server-xxx 1/1 Running # litmus-frontend-xxx 1/1 Running # litmus-mongo-xxx 1/1 Running # chaos-operator-ce-xxx 1/1 Running # Akses ChaosCenter UI kubectl port-forward svc/litmusportal-frontend -n litmus 3030:80 # Buka http://localhost:3030 # Default credentials: admin / litmus # Install litmusctl CLI curl -sL https://litmusctl.litmuschaos.io/install.sh | bash # Atau download dari GitHub releases # Login via CLI litmusctl config set-account --endpoint="http://localhost:3030" \ --username="admin" --password="litmus"
# ServiceAccount dan RBAC untuk chaos experiments
apiVersion: v1
kind: ServiceAccount
metadata:
name: litmus-admin
namespace: litmus
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: litmus-admin
rules:
- apiGroups: [""]
resources: ["pods", "pods/exec", "pods/log", "events", "services",
"configmaps", "secrets", "persistentvolumeclaims"]
verbs: ["create", "delete", "get", "list", "patch", "update"]
- apiGroups: ["apps"]
resources: ["deployments", "statefulsets", "daemonsets"]
verbs: ["create", "delete", "get", "list", "patch", "update"]
- apiGroups: ["litmuschaos.io"]
resources: ["*"]
verbs: ["*"]
- apiGroups: ["batch"]
resources: ["jobs"]
verbs: ["create", "delete", "get", "list"]
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: litmus-admin
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: litmus-admin
subjects:
- kind: ServiceAccount
name: litmus-admin
namespace: litmus
4. Experiments
ChaosExperiment mendefinisikan fault yang akan di-inject. Litmus memiliki library eksperimen yang bisa diinstall dari ChaosHub.
# Install eksperimen dari ChaosHub
# litmusctl get chaos-experiments --hub-name=ChaosHub
# Atau install manual
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: pod-delete-engine
namespace: default
spec:
# Target aplikasi
appinfo:
appns: default
applabel: app=my-app
appkind: deployment
# ServiceAccount yang digunakan
serviceAccount: litmus-admin
# Experiments yang akan dijalankan
experiments:
- name: pod-delete
spec:
components:
env:
# Jumlah pod yang akan di-delete
- name: TOTAL_CHAOS_DURATION
value: '30'
# Interval antara pod deletion
- name: CHAOS_INTERVAL
value: '10'
# Force delete (tanpa graceful termination)
- name: FORCE
value: 'false'
# Target pod count
- name: TARGET_PODS
value: ''
# Pod affected percentage
- name: PODS_AFFECTED_PERC
value: ''
# Probes untuk verifikasi
probes:
- name: "Verify app is running"
type: "httpProbe"
mode: "Continuous"
httpProbe/inputs:
url: "http://my-app.default.svc:8080/health"
method:
get:
criteria: "=="
responseCode: "200"
runProperties:
probeTimeout: 5
interval: 2
retry: 3
# Experiment execution policy
engineState: "active"
chaosServiceAccount: litmus-admin
Jangan langsung menjalankan chaos experiments di production. Mulai dari environment staging yang menyerupai production. Pastikan Anda memiliki monitoring yang memadai (metrics, logs, alerts) sebelum menjalankan eksperimen di production.
5. Chaos Hub
ChaosHub adalah library eksperimen yang bisa diinstall langsung. Litmus memiliki lebih dari 50+ fault types yang tersedia.
# Install eksperimen spesifik ke cluster kubectl apply -f https://hub.litmuschaos.io/api/chaos/1.13.0?file=charts/generic/experiments.yaml # Kategori eksperimen yang tersedia di ChaosHub: # === Kubernetes Faults === # pod-delete — Hapus pods secara acak # pod-cpu-hog — Stress CPU pada pods # pod-memory-hog — Exhaust memory pada pods # pod-network-loss — Simulasikan packet loss # pod-network-delay — Tambahkan network latency # pod-dns-error — Simulasikan DNS failures # === Application Faults === # node-drain — Drain node dari pods # node-cpu-hog — Stress CPU pada node # node-memory-hog — Exhaust memory node # node-taint — Taint nodes # kubelet-service-kill — Kill kubelet service # === Cloud Provider Faults === # ec2-terminate — Terminate EC2 instances # ebs-loss — Detach EBS volumes # gcp-vm-stop — Stop GCP VM instances # azure-vm-stop — Stop Azure VM instances # === Network Faults === # network-partition — Simulasikan network partition # network-duplication — Duplicate network packets # network-jitter — Add network jitter # bandwidth-limit — Limit network bandwidth # Install eksperimen spesifik kubectl apply -f https://hub.litmuschaos.io/api/chaos/1.13.0?file=charts/generic/experiments.yaml -n litmus # Cek eksperimen yang terinstall kubectl get chaosexperiments -n litmus
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: network-delay-engine
namespace: default
spec:
appinfo:
appns: default
applabel: app=api-gateway
appkind: deployment
experiments:
- name: pod-network-delay
spec:
components:
env:
- name: NETWORK_LATENCY
value: '2000' # 2 detik delay
- name: JITTER
value: '500' # 500ms jitter
- name: DESTINATION_IPS
value: '' # Semua IPs
- name: DESTINATION_HOSTS
value: '' # Semua hosts
- name: CONTAINER_RUNTIME
value: 'containerd'
- name: TOTAL_CHAOS_DURATION
value: '60'
- name: CHAOS_INTERVAL
value: '10'
probes:
- name: "Verify latency is bounded"
type: "promProbe"
mode: "Edge"
promProbe/inputs:
endpoint: "http://prometheus.monitoring:9090"
query: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[1m]))) < 5
comparator:
type: "float"
criteria: "<"
value: "5"
runProperties:
probeTimeout: 10
interval: 5
6. Probes & Assertions
Probes adalah assertions yang dijalankan sebelum, selama, dan setelah eksperimen. Probes memverifikasi bahwa sistem berperilaku sesuai harapan selama fault injection.
| Probe Type | Deskripsi | Contoh |
|---|---|---|
httpProbe | Validasi HTTP endpoint | Health check returns 200 |
cmdProbe | Validasi via command execution | kubectl get pods returns Running |
k8sProbe | Validasi Kubernetes resource state | Deployment ready replicas > 2 |
promProbe | Validasi via PromQL query | Error rate < 5% |
# Probes di dalam ChaosEngine experiment spec
experiments:
- name: pod-delete
spec:
probes:
# Probe 1: HTTP health check (continuous)
- name: "App is responding"
type: "httpProbe"
mode: "Continuous"
httpProbe/inputs:
url: "http://my-app.default:8080/health"
method:
get:
criteria: "=="
responseCode: "200"
runProperties:
probeTimeout: 5
interval: 2
retry: 3
probePollingInterval: 2
# Probe 2: K8s resource validation (edge)
- name: "Deployment has minimum replicas"
type: "k8sProbe"
mode: "Edge"
k8sProbe/inputs:
command:
command: ["kubectl", "get", "deploy", "my-app",
"-o", "jsonpath='{.status.readyReplicas}'"]
comparator:
type: "int"
criteria: ">="
value: "2"
runProperties:
probeTimeout: 10
# Probe 3: Prometheus metric validation (edge)
- name: "Error rate below threshold"
type: "promProbe"
mode: "Edge"
promProbe/inputs:
endpoint: "http://prometheus.monitoring:9090"
query: |
sum(rate(http_requests_total{status=~"5.."}[2m]))
/ sum(rate(http_requests_total[2m])) * 100
comparator:
type: "float"
criteria: "<"
value: "5"
runProperties:
probeTimeout: 15
interval: 10
# Probe 4: Command validation (onChaos)
- name: "DB connection exists"
type: "cmdProbe"
mode: "OnChaos"
cmdProbe/inputs:
command: "pg_isready -h postgres.default -p 5432"
comparator:
type: "string"
criteria: "contains"
value: "accepting"
runProperties:
probeTimeout: 10
7. ChaosEngine & Faults
# ChaosEngine lengkap dengan multiple experiments
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: comprehensive-resilience-test
namespace: default
labels:
app.kubernetes.io/part-of: resilience-testing
environment: staging
spec:
engineState: active
chaosServiceAccount: litmus-admin
appinfo:
appns: default
applabel: app=my-app,env=staging
appkind: deployment
experiments:
# Experiment 1: Pod Delete
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: '30'
- name: CHAOS_INTERVAL
value: '10'
- name: FORCE
value: 'false'
probes:
- name: "App recovers after pod delete"
type: "httpProbe"
mode: "Edge"
httpProbe/inputs:
url: "http://my-app:8080/health"
method:
get:
criteria: "=="
responseCode: "200"
runProperties:
probeTimeout: 30
interval: 5
retry: 10
# Experiment 2: CPU Stress
- name: pod-cpu-hog
spec:
components:
env:
- name: CPU_CORES
value: '2'
- name: CPU_LOAD
value: '100'
- name: TOTAL_CHAOS_DURATION
value: '60'
probes:
- name: "Latency still acceptable under CPU pressure"
type: "promProbe"
mode: "Edge"
promProbe/inputs:
endpoint: "http://prometheus.monitoring:9090"
query: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[1m]))) < 2
comparator:
type: "float"
criteria: "<"
value: "2"
# Experiment 3: Network Loss
- name: pod-network-loss
spec:
components:
env:
- name: NETWORK_PACKET_LOSS_PERCENTAGE
value: '30'
- name: DESTINATION_IPS
value: ''
- name: DESTINATION_HOSTS
value: ''
- name: CONTAINER_RUNTIME
value: 'containerd'
- name: TOTAL_CHAOS_DURATION
value: '60'
probes:
- name: "App handles packet loss gracefully"
type: "k8sProbe"
mode: "Edge"
k8sProbe/inputs:
command:
command: ["kubectl", "get", "deploy", "my-app",
"-o", "jsonpath='{.status.readyReplicas}'"]
comparator:
type: "int"
criteria: ">="
value: "1"
# Schedule — jalankan eksperimen secara periodik
schedule:
repeat: weekly
time: "02:00" # Jam 2 pagi
dayOfWeek: "Saturday"
# Rollback jika probes gagal
jobCleanUpPolicy: delete
8. Resilience Scoring
Resilience Score adalah metrik yang mengukur seberapa tangguh aplikasi Anda terhadap kegagalan. Score dihitung berdasarkan pass/fail ratio dari semua probes yang dijalankan selama eksperimen.
# Formula Resilience Score
# RS = (Jumlah probe yang passed / Total probes) × 100
# Contoh: ChaosEngine dengan 3 experiments
# Experiment 1 (pod-delete): 2/3 probes passed → 66.7%
# Experiment 2 (cpu-stress): 3/3 probes passed → 100%
# Experiment 3 (net-loss): 1/3 probes passed → 33.3%
# Overall Resilience Score = (66.7 + 100 + 33.3) / 3 = 66.7%
# Ambil hasil dari ChaosResult CRD
# kubectl get chaosresult -n default -o yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosResult
metadata:
name: pod-delete-pod-delete
namespace: default
spec:
engine: comprehensive-resilience-test
experiment: pod-delete
status:
experimentStatus:
phase: Completed
verdict: Pass
probeSuccessPercentage: "66.7"
runningPhase: Completed
# History dari setiap run
history:
- verdict: Pass
probeSuccessPercentage: "66.7"
updatedAt: "2026-06-29T02:15:00Z"
- verdict: Fail
probeSuccessPercentage: "33.3"
updatedAt: "2026-06-22T02:15:00Z"
# Monitoring resilience score di Grafana
# PromQL query untuk Litmus metrics
# litmus_experiment_verdict{experiment="pod-delete"}
# litmus_probe_success_percentage
# Target resilience score per environment:
# Development: 60%+ (acceptable)
# Staging: 80%+ (good)
# Production: 90%+ (excellent)
8.1 CI/CD Integration
# .github/workflows/chaos-test.yml
name: Chaos Engineering Test
on:
pull_request:
branches: [main]
schedule:
- cron: '0 2 * * 6' # Setiap Sabtu jam 2 pagi
jobs:
chaos-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Kubernetes cluster
uses: helm/kind-action@v1
- name: Install Litmus
run: |
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm install litmus litmuschaos/litmus \
--namespace litmus --create-namespace
- name: Deploy application
run: kubectl apply -f k8s/
- name: Run chaos experiments
run: |
kubectl apply -f chaos/pod-delete.yaml
# Tunggu eksperimen selesai
kubectl wait --for=condition=Completed \
chaosresult/pod-delete-pod-delete \
--timeout=300s
- name: Check resilience score
run: |
SCORE=$(kubectl get chaosresult -o jsonpath=
'{.items[0].status.experimentStatus.probeSuccessPercentage}')
echo "Resilience Score: $SCORE%"
if (( $(echo "$SCORE < 80" | bc -l) )); then
echo "❌ Resilience score below threshold!"
exit 1
fi
echo "✅ Resilience check passed"
Chaos Engineering bukan tentang merusak sistem. Pastikan selalu: (1) Punya rollback plan, (2) Blast radius dibatasi, (3) Monitoring aktif, (4) Tim yang aware, (5) Mulai dari staging. Jangan pernah menjalankan chaos experiment pada sistem yang Anda tidak pahami sepenuhnya.
9. Quiz: Uji Pemahamanmu!
Setelah membaca tutorial di atas, jawablah 5 pertanyaan berikut: