Prometheus & Grafana Advanced

📋 Daftar Isi

Pengenalan Prometheus & Grafana
Arsitektur Monitoring Stack
Instalasi Stack Lengkap
Tipe Metric di Prometheus
Custom Metrics dari Aplikasi
Prometheus Exporters
PromQL Advanced
Alerting Rules & Alertmanager
Membuat Grafana Dashboard
Grafana Advanced Features
Best Practices
Quiz Pemahaman

1. Pengenalan Prometheus & Grafana

Prometheus adalah sistem monitoring dan alerting open-source yang awalnya dikembangkan oleh SoundCloud, sekarang menjadi project CNCF (Cloud Native Computing Foundation) yang sangat populer di ekosistem Kubernetes dan cloud-native.

Grafana adalah platform visualisasi dan analytics yang sangat powerful untuk membuat dashboard interaktif dari berbagai sumber data termasuk Prometheus. Bersama, keduanya membentuk standar emas untuk observability di era cloud-native.

Mengapa Monitoring Penting?

Aspek	Tanpa Monitoring	Dengan Monitoring
Deteksi Masalah	🔴 User melapor duluan	🟢 Alert otomatis sebelum user tahu
Root Cause Analysis	🔴 Tebak-tebakan	🟢 Data-driven, tahu persis kapan dan dimana
Capacity Planning	🔴 Over-provisioning atau under-provisioning	🟢 Data-driven decisions
SLA/SLO	🔴 Tidak bisa diukur	🟢 Precise measurement & tracking
Performance Tuning	🔴 Tidak tahu bottleneck	🟢 Profiling & targeted optimization

Pull vs Push Model

Prometheus menggunakan model pull — ia secara aktif mengambil (scrape) metrics dari target. Ini berbeda dari sistem push-based seperti Graphite atau StatsD:

Diagram: Pull vs Push Monitoring

  PUSH MODEL                        PULL MODEL (Prometheus)
  ┌──────────┐                      ┌──────────────┐
  │ App A    │──push──┐             │  Prometheus   │
  ├──────────┤        │             │  Server       │
  │ App B    │──push──┼──►┌───────┐ │               │
  ├──────────┤        │   │Central│ │  Scrape       │
  │ App C    │──push──┘   │Server │◄┤  (pull)       │
  └──────────┘            └───────┘ │               │
                                    │               │
  Setiap app mengirim data          │  /metrics ◄───┤── App A
  ke central server                 │  /metrics ◄───┤── App B
  → Bisa overload                   │  /metrics ◄───┤── App C
  → Data bisa hilang                └──────────────┘
                                    Server mengambil data dari apps
                                    → Controlled rate
                                    → Reliable (retry built-in)

2. Arsitektur Monitoring Stack

Diagram: Full Monitoring Stack Architecture

┌──────────────────────────────────────────────────────────────┐
│                  MONITORING STACK                             │
│                                                              │
│  ┌──────────────────────────────────────────────────────┐   │
│  │                 DATA SOURCES                          │   │
│  │  ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐       │   │
│  │  │ App A  │ │ App B  │ │ Node   │ │ MySQL  │       │   │
│  │  │/metrics│ │/metrics│ │Exporter│ │Exporter│       │   │
│  │  └───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘       │   │
│  └──────┼──────────┼──────────┼──────────┼──────────────┘   │
│         │          │          │          │                   │
│         ▼          ▼          ▼          ▼                   │
│  ┌──────────────────────────────────────────────────────┐   │
│  │              PROMETHEUS SERVER                         │   │
│  │  ┌─────────────┐  ┌──────────────┐  ┌────────────┐  │   │
│  │  │  TSDB       │  │  Rules Engine│  │  HTTP API  │  │   │
│  │  │  (Storage)  │  │  (Alerting)  │  │  (Query)   │  │   │
│  │  └─────────────┘  └──────┬───────┘  └──────┬─────┘  │   │
│  └──────────────────────────┼──────────────────┼────────┘   │
│                             │                  │             │
│                             ▼                  ▼             │
│  ┌──────────────┐   ┌──────────────┐   ┌──────────────┐   │
│  │Alertmanager  │   │   Grafana    │   │   PromQL     │   │
│  │              │   │  (Dashboard) │   │   (Query)    │   │
│  │→ Slack       │   │              │   │              │   │
│  │→ Email       │   │  Panels:     │   │  rate()      │   │
│  │→ PagerDuty   │   │  - Graphs    │   │  histogram   │   │
│  │→ Webhook     │   │  - Tables    │   │  quantile    │   │
│  └──────────────┘   │  - Gauges    │   └──────────────┘   │
│                     └──────────────┘                       │
└──────────────────────────────────────────────────────────────┘

Komponen utama monitoring stack:

Prometheus Server — mengumpulkan dan menyimpan time-series data
Exporters — mengkonversi metrics dari sistem lain ke format Prometheus
Alertmanager — mengelola dan mengirim alert notifications
Grafana — visualisasi dan dashboard interaktif
Pushgateway — untuk short-lived jobs yang tidak bisa di-scrape

3. Instalasi Stack Lengkap

Docker Compose (Development)

YAML

# docker-compose-monitoring.yaml
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:v2.53.0
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/rules:/etc/prometheus/rules
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
    restart: unless-stopped

  alertmanager:
    image: prom/alertmanager:v0.27.0
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
    restart: unless-stopped

  grafana:
    image: grafana/grafana:11.1.0
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=admin123
      - GF_INSTALL_PLUGINS=grafana-clock-panel
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    restart: unless-stopped

  node-exporter:
    image: prom/node-exporter:v1.8.1
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'
    restart: unless-stopped

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.49.1
    ports:
      - "8080:8080"
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker:/var/lib/docker:ro
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:

Prometheus Configuration

YAML

# prometheus/prometheus.yml
global:
  scrape_interval: 15s         # Default scrape interval
  evaluation_interval: 15s     # Rules evaluation interval
  scrape_timeout: 10s          # Scrape timeout

# Alert Rules
rule_files:
  - /etc/prometheus/rules/*.yml

# Alertmanager Configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

# Scrape Configurations
scrape_configs:
  # Prometheus self-monitoring
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node Exporter (system metrics)
  - job_name: 'node-exporter'
    static_configs:
      - targets:
          - 'node-exporter:9100'
        labels:
          environment: 'production'

  # cAdvisor (container metrics)
  - job_name: 'cadvisor'
    static_configs:
      - targets:
          - 'cadvisor:8080'

  # Application metrics
  - job_name: 'my-app'
    metrics_path: '/metrics'
    scrape_interval: 10s
    static_configs:
      - targets:
          - 'app:8080'
        labels:
          app: 'myapp'
          environment: 'production'

  # MySQL Exporter
  - job_name: 'mysql'
    static_configs:
      - targets:
          - 'mysql-exporter:9104'

  # Redis Exporter
  - job_name: 'redis'
    static_configs:
      - targets:
          - 'redis-exporter:9121'

  # NGINX Exporter
  - job_name: 'nginx'
    static_configs:
      - targets:
          - 'nginx-exporter:9113'

Kubernetes dengan Helm

Bash

# Tambah repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# Install kube-prometheus-stack (Prometheus + Grafana + Alertmanager + exporters)
helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set grafana.adminPassword=admin123 \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi

# Verifikasi
kubectl get pods -n monitoring

# Akses Grafana
kubectl port-forward -n monitoring svc/monitoring-grafana 3000:80
# Buka http://localhost:3000 (admin/admin123)

# Akses Prometheus
kubectl port-forward -n monitoring svc/monitoring-kube-prometheus-prometheus 9090:9090

4. Tipe Metric di Prometheus

Prometheus mendukung empat tipe metric dasar. Memahami perbedaannya sangat penting untuk menggunakan PromQL dengan benar.

Tipe	Penjelasan	Contoh	PromQL
Counter	Nilai yang selalu naik (atau reset ke 0 saat restart)	HTTP requests total, bytes sent	`rate()`
Gauge	Nilai yang bisa naik dan turun	Temperature, memory usage, queue size	`avg()`, `max()`
Histogram	Samples observations dalam bucket (distribusi nilai)	Request duration, response size	`histogram_quantile()`
Summary	Seperti histogram tapi menghitung quantile di client	Request duration quantiles	Langsung akses quantiles

PromQL

# Counter — selalu gunakan rate() atau increase()
# Total HTTP requests per detik (5 menit terakhir)
rate(http_requests_total[5m])

# Total request dalam 1 jam terakhir
increase(http_requests_total[1h])

# Gauge — langsung query
# Memory usage saat ini
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100

# CPU load average
node_load1

# Histogram — gunakan histogram_quantile()
# 95th percentile request duration
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# 99th percentile
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# Average request duration
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])

5. Custom Metrics dari Aplikasi

Untuk memonitor aplikasi sendiri, Anda perlu menambahkan library client Prometheus ke dalam kode aplikasi dan mengekspos endpoint /metrics.

Python (Flask + prometheus_client)

Python

# app.py - Flask dengan Prometheus metrics
from flask import Flask, request
from prometheus_client import (
    Counter, Histogram, Gauge, Info,
    generate_latest, CONTENT_TYPE_LATEST
)
import time
import psutil

app = Flask(__name__)

# === METRIC DEFINITIONS ===

# Counter: total HTTP requests
http_requests_total = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status_code']
)

# Histogram: request duration
http_request_duration = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration in seconds',
    ['method', 'endpoint'],
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)

# Gauge: active connections
active_connections = Gauge(
    'app_active_connections',
    'Number of active connections'
)

# Gauge: queue size
queue_size = Gauge(
    'app_queue_size',
    'Current queue size',
    ['queue_name']
)

# Info: app version
app_info = Info(
    'app',
    'Application information'
)
app_info.info({
    'version': '1.0.0',
    'language': 'python',
    'framework': 'flask'
})

# === MIDDLEWARE ===
@app.before_request
def before_request():
    request._start_time = time.time()
    active_connections.inc()

@app.after_request
def after_request(response):
    # Record request duration
    duration = time.time() - request._start_time
    http_request_duration.labels(
        method=request.method,
        endpoint=request.path
    ).observe(duration)

    # Increment request counter
    http_requests_total.labels(
        method=request.method,
        endpoint=request.path,
        status_code=response.status_code
    ).inc()

    active_connections.dec()
    return response

# === METRICS ENDPOINT ===
@app.route('/metrics')
def metrics():
    return generate_latest(), 200, {'Content-Type': CONTENT_TYPE_LATEST}

# === APPLICATION ROUTES ===
@app.route('/')
def index():
    return {'status': 'ok', 'version': '1.0.0'}

@app.route('/api/users')
def get_users():
    time.sleep(0.1)  # Simulate work
    return {'users': ['user1', 'user2', 'user3']}

@app.route('/api/orders')
def get_orders():
    queue_size.labels(queue_name='orders').set(42)
    return {'orders': []}

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8080)

Node.js (Express + prom-client)

JavaScript

// server.js - Express dengan Prometheus metrics
const express = require('express');
const client = require('prom-client');

const app = express();
const PORT = 8080;

// === METRIC DEFINITIONS ===
const httpRequestsTotal = new client.Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status']
});

const httpRequestDuration = new client.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'route'],
  buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]
});

const activeConnections = new client.Gauge({
  name: 'app_active_connections',
  help: 'Number of active connections'
});

const dbQueryDuration = new client.Histogram({
  name: 'db_query_duration_seconds',
  help: 'Database query duration',
  labelNames: ['query_type', 'table'],
  buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0]
});

// Collect default metrics (CPU, memory, event loop lag)
client.collectDefaultMetrics({ prefix: 'nodejs_' });

// === MIDDLEWARE ===
app.use((req, res, next) => {
  const start = Date.now();
  activeConnections.inc();

  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    httpRequestsTotal.inc({
      method: req.method,
      route: req.path,
      status: res.statusCode
    });
    httpRequestDuration.observe(
      { method: req.method, route: req.path },
      duration
    );
    activeConnections.dec();
  });
  next();
});

// === METRICS ENDPOINT ===
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', client.register.contentType);
  res.send(await client.register.metrics());
});

// === APPLICATION ROUTES ===
app.get('/', (req, res) => {
  res.json({ status: 'ok', version: '1.0.0' });
});

app.get('/api/users', (req, res) => {
  // Simulate DB query timing
  const end = dbQueryDuration.startTimer({ query_type: 'SELECT', table: 'users' });
  // ... query database ...
  end();
  res.json({ users: ['user1', 'user2'] });
});

app.listen(PORT, () => {
  console.log(`App running on port ${PORT}`);
});

6. Prometheus Exporters

Exporters mengkonversi metrics dari sistem yang tidak mendukung Prometheus native ke format yang bisa di-scrape oleh Prometheus.

Exporters Populer

Exporter	Fungsi	Port Default	Metrics Contoh
node_exporter	System metrics (CPU, RAM, disk, network)	9100	`node_cpu_seconds_total`
mysqld_exporter	MySQL metrics	9104	`mysql_global_status_queries`
redis_exporter	Redis metrics	9121	`redis_connected_clients`
nginx_exporter	NGINX metrics	9113	`nginx_http_requests_total`
blackbox_exporter	Probe endpoints (HTTP, TCP, ICMP)	9115	`probe_http_status_code`
cadvisor	Container metrics (Docker)	8080	`container_cpu_usage_seconds_total`
postgres_exporter	PostgreSQL metrics	9187	`pg_stat_activity_count`
mongodb_exporter	MongoDB metrics	9216	`mongodb_connections`

Instalasi Node Exporter (Docker)

Bash

# Jalankan node_exporter
docker run -d \
  --name node-exporter \
  --net=host \
  --pid=host \
  -v "/:/host:ro,rslave" \
  prom/node-exporter:v1.8.1 \
  --path.rootfs=/host

# Cek metrics
curl http://localhost:9100/metrics | head -50

# Contoh output:
# node_cpu_seconds_total{cpu="0",mode="idle"} 123456.78
# node_memory_MemTotal_bytes 8589934592
# node_filesystem_size_bytes{mountpoint="/"} 107374182400

Blackbox Exporter (HTTP Probing)

YAML

# blackbox.yml - Konfigurasi blackbox exporter
modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
      valid_status_codes: [200]
      method: GET
      follow_redirects: true
      preferred_ip_protocol: "ip4"

  http_post_2xx:
    prober: http
    http:
      method: POST
      valid_status_codes: [200, 201]

  tcp_connect:
    prober: tcp
    timeout: 5s

  icmp:
    prober: icmp
    timeout: 5s

---
# prometheus.yml — scrape config untuk blackbox
scrape_configs:
  - job_name: 'blackbox-http'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://app.example.com
          - https://api.example.com
          - https://blog.example.com
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

7. PromQL Advanced

PromQL (Prometheus Query Language) adalah bahasa query yang sangat powerful untuk mengekstrak dan menganalisis time-series data.

Rate & Increase

PromQL

# rate() — perhitungan per-detik untuk counter (wajib untuk counter!)
rate(http_requests_total[5m])

# Per-second rate per status code
rate(http_requests_total{status_code="500"}[5m])

# increase() — total increase dalam periode
increase(http_requests_total[1h])

# irate() — instant rate (gunakan 2 data point terakhir)
irate(http_requests_total[5m])

# Perbedaan rate vs irate:
# rate()  → smooth, good untuk alerting
# irate() → responsive, good untuk dashboards

Aggregation

PromQL

# sum — total requests across all instances
sum(rate(http_requests_total[5m]))

# sum by label — group by endpoint
sum by (endpoint) (rate(http_requests_total[5m]))

# avg — average CPU usage across nodes
avg(node_cpu_seconds_total{mode="idle"})

# max — highest memory usage
max(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)

# min — lowest disk space
min(node_filesystem_avail_bytes{mountpoint="/"})

# topk — top 5 busiest endpoints
topk(5, sum by (endpoint) (rate(http_requests_total[5m])))

# bottomk — bottom 3 slowest instances
bottomk(3, avg by (instance) (rate(http_request_duration_seconds_sum[5m])))

# count — number of instances
count(up == 1)

# stddev — standard deviation of response times
stddev(rate(http_request_duration_seconds_sum[5m]))

Advanced Queries

PromQL

# Error rate percentage
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
* 100

# Request duration percentiles
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))  # p50
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))  # p95
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))  # p99

# Saturation: CPU usage percentage per node
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage percentage
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# Disk usage percentage
(1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100

# Network throughput (MB/s)
rate(node_network_receive_bytes_total[5m]) / 1024 / 1024
rate(node_network_transmit_bytes_total[5m]) / 1024 / 1024

# Apdex Score (Application Performance Index)
(
  sum(rate(http_request_duration_seconds_bucket{le="0.1"}[5m]))
  +
  sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
) / 2
/
sum(rate(http_request_duration_seconds_count[5m]))

# Predict disk full in 4 hours
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[1h], 4*3600) < 0

8. Alerting Rules & Alertmanager

Alerting Rules

YAML

# prometheus/rules/alerts.yml
groups:
  - name: application_alerts
    rules:
      # High Error Rate
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status_code=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))
          > 0.05
        for: 5m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "High 5xx error rate on {{ $labels.instance }}"
          description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"
          runbook_url: "https://wiki.example.com/runbooks/high-error-rate"

      # High Latency
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High p95 latency on {{ $labels.instance }}"
          description: "p95 latency is {{ $value }}s (threshold: 2s)"

      # Service Down
      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.instance }} is DOWN"
          description: "{{ $labels.job }} on {{ $labels.instance }} has been down for more than 1 minute"

  - name: infrastructure_alerts
    rules:
      # High CPU Usage
      - alert: HighCPUUsage
        expr: |
          100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value | printf \"%.1f\" }}%"

      # Low Memory
      - alert: LowMemory
        expr: |
          (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Low memory on {{ $labels.instance }}"
          description: "Memory usage is {{ $value | printf \"%.1f\" }}%"

      # Disk Almost Full
      - alert: DiskAlmostFull
        expr: |
          (1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Disk almost full on {{ $labels.instance }}"
          description: "Disk usage is {{ $value | printf \"%.1f\" }}%"

      # Disk will be full in 4 hours
      - alert: DiskWillFillSoon
        expr: predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 4*3600) < 0
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Disk on {{ $labels.instance }} predicted to fill within 4 hours"

      # Container Restart Loop
      - alert: ContainerRestartLoop
        expr: increase(kube_pod_container_status_restarts_total[1h]) > 3
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Container {{ $labels.container }} restarting frequently"
          description: "Pod {{ $labels.pod }} restarted {{ $value }} times in the last hour"

Alertmanager Configuration

YAML

# alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'alerts@example.com'
  smtp_auth_username: 'alerts@example.com'
  smtp_auth_password: 'app-password'

# Routing rules
route:
  receiver: 'default-slack'
  group_by: ['alertname', 'instance']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    # Critical alerts → PagerDuty + Slack
    - match:
        severity: critical
      receiver: 'critical-pagerduty'
      group_wait: 10s
      repeat_interval: 1h

    # Warning alerts → Slack only
    - match:
        severity: warning
      receiver: 'warning-slack'
      repeat_interval: 4h

    # Team-specific routing
    - match:
        team: backend
      receiver: 'backend-slack'

# Receivers
receivers:
  - name: 'default-slack'
    slack_configs:
      - channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: 'critical-pagerduty'
    pagerduty_configs:
      - service_key: 'your-pagerduty-key'
        severity: 'critical'
    slack_configs:
      - channel: '#alerts-critical'
        color: 'danger'

  - name: 'warning-slack'
    slack_configs:
      - channel: '#alerts-warning'
        color: 'warning'

  - name: 'backend-slack'
    slack_configs:
      - channel: '#backend-alerts'

# Inhibition rules
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

9. Membuat Grafana Dashboard

Provisioning Datasource

YAML

# grafana/provisioning/datasources/prometheus.yaml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true

  - name: Alertmanager
    type: alertmanager
    access: proxy
    url: http://alertmanager:9093
    jsonData:
      implementation: prometheus

Dashboard JSON (Node Overview)

JSON

{
  "dashboard": {
    "title": "Node Overview",
    "tags": ["infrastructure", "node"],
    "timezone": "browser",
    "panels": [
      {
        "title": "CPU Usage",
        "type": "timeseries",
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
        "targets": [
          {
            "expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
            "legendFormat": "{{instance}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "max": 100,
            "thresholds": {
              "steps": [
                {"value": null, "color": "green"},
                {"value": 70, "color": "yellow"},
                {"value": 85, "color": "red"}
              ]
            }
          }
        }
      },
      {
        "title": "Memory Usage",
        "type": "gauge",
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
        "targets": [
          {
            "expr": "(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100",
            "legendFormat": "{{instance}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "min": 0,
            "max": 100,
            "thresholds": {
              "steps": [
                {"value": null, "color": "green"},
                {"value": 80, "color": "yellow"},
                {"value": 90, "color": "red"}
              ]
            }
          }
        }
      },
      {
        "title": "Disk Usage",
        "type": "bargauge",
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 8},
        "targets": [
          {
            "expr": "(1 - node_filesystem_avail_bytes{mountpoint=\"/\"} / node_filesystem_size_bytes{mountpoint=\"/\"}) * 100",
            "legendFormat": "{{instance}}"
          }
        ]
      },
      {
        "title": "Network I/O",
        "type": "timeseries",
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 8},
        "targets": [
          {
            "expr": "rate(node_network_receive_bytes_total[5m])",
            "legendFormat": "RX {{instance}}"
          },
          {
            "expr": "rate(node_network_transmit_bytes_total[5m])",
            "legendFormat": "TX {{instance}}"
          }
        ],
        "fieldConfig": {
          "defaults": {"unit": "Bps"}
        }
      }
    ],
    "refresh": "30s",
    "time": {"from": "now-1h", "to": "now"}
  }
}

10. Grafana Advanced Features

Variables (Template)

Variables memungkinkan dashboard menjadi dinamis dan bisa digunakan untuk filter:

PromQL (Variable Query)

# Variable: $instance
# Query: label_values(up, instance)
# Result: web1:9100, web2:9100, db1:9100

# Variable: $job
# Query: label_values(up, job)
# Result: node-exporter, prometheus, mysql

# Variable: $interval
# Type: Interval
# Values: 1m, 5m, 15m, 1h, 6h, 1d

# Menggunakan variables di query:
rate(http_requests_total{instance="$instance"}[$interval])
avg by (job) (rate(http_requests_total[$interval]))
node_memory_MemAvailable_bytes{instance=~"$instance"}

Annotations

YAML

# Grafana annotations dari Prometheus alerts
# Otomatis muncul di dashboard ketika alert triggered

# Custom annotations dari data source
# Query: ALERTS{alertstate="firing"}

# Manual annotations dari API
curl -X POST http://admin:admin123@localhost:3000/api/annotations \
  -H "Content-Type: application/json" \
  -d '{
    "dashboardUID": "abc123",
    "time": 1625097600000,
    "text": "Deployed v2.0.0",
    "tags": ["deploy", "v2.0.0"]
  }'

Panel Types Populer

Panel Type	Best For	Contoh Penggunaan
Time Series	Data berubah over time	CPU, memory, request rate
Gauge	Nilai saat ini vs threshold	Disk usage, SLA compliance
Stat	Single number	Total requests, uptime
Bar Gauge	Perbandingan antar item	CPU per instance
Table	Data tabular	Top 10 slowest endpoints
Heatmap	Distribusi density	Request latency distribution
Pie Chart	Distribusi persentase	Traffic by status code
Logs	Log exploration	Application logs dari Loki

Grafana Alerting (Unified Alerting)

YAML

# Grafana provisioning alert rules
# grafana/provisioning/alerting/rules.yaml
apiVersion: 1

groups:
  - orgId: 1
    name: Application Alerts
    folder: Monitoring
    interval: 1m
    rules:
      - uid: high-error-rate
        title: High Error Rate
        condition: C
        data:
          - refId: A
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: prometheus
            model:
              expr: 'sum(rate(http_requests_total{status_code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100'
              instant: true
          - refId: C
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: __expr__
            model:
              type: threshold
              expression: A
              conditions:
                - evaluator:
                    type: gt
                    params: [5]
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High 5xx error rate detected"

11. Best Practices

Four Golden Signals (Google SRE)

Signal	Penjelasan	Prometheus Metric
Latency	Waktu respons request	`histogram_quantile(0.95, ...)`
Traffic	Banyaknya request yang dilayani	`rate(http_requests_total[5m])`
Errors	Rate request yang gagal	`rate(http_requests_total{status=~"5.."}[5m])`
Saturation	Seberapa penuh resource	`CPU%, memory%, disk%`

Monitoring Checklist

Monitor semua 4 golden signals: latency, traffic, errors, saturation
Gunakan rate() untuk counter, jangan query raw counter
Set retention sesuai kebutuhan (default 15 hari, production: 30 hari)
Gunakan recording rules untuk query yang sering dipakai
Tiered alerting: warning (Slack) → critical (PagerDuty + Slack)
Test alert rules dengan promtool
Gunakan dashboard variables untuk filtering
Simpan dashboard sebagai code (JSON di Git)

⚠️ Peringatan

Hindari high-cardinality labels (seperti user_id, request_id) karena bisa menyebabkan Prometheus kehabisan memory. Gunakan label yang memiliki nilai terbatas (endpoint, method, status_code).

Prometheus & Grafana Advanced

1. Pengenalan Prometheus & Grafana

Mengapa Monitoring Penting?

Pull vs Push Model

2. Arsitektur Monitoring Stack

3. Instalasi Stack Lengkap

Docker Compose (Development)

Prometheus Configuration

Kubernetes dengan Helm

4. Tipe Metric di Prometheus

5. Custom Metrics dari Aplikasi

Python (Flask + prometheus_client)

Node.js (Express + prom-client)

6. Prometheus Exporters

Exporters Populer

Instalasi Node Exporter (Docker)

Blackbox Exporter (HTTP Probing)

7. PromQL Advanced

Rate & Increase

Aggregation

Advanced Queries

8. Alerting Rules & Alertmanager

Alerting Rules

Alertmanager Configuration

9. Membuat Grafana Dashboard

Provisioning Datasource

Dashboard JSON (Node Overview)

10. Grafana Advanced Features

Variables (Template)

Annotations

Panel Types Populer

Grafana Alerting (Unified Alerting)

11. Best Practices

Four Golden Signals (Google SRE)

Monitoring Checklist

12. Quiz Pemahaman

Artikel Terkait

Kubernetes Helm: Package Manager

Docker untuk Developer

Ansible: Automation Platform