DevOps & Cloud

Prometheus & Grafana Advanced

TOKEN

Tutorial lanjutan monitoring dengan Prometheus & Grafana β€” custom metrics, PromQL advanced, alerting rules, dashboard creation, exporters, dan best practices production

1. Pengenalan Prometheus & Grafana

Prometheus adalah sistem monitoring dan alerting open-source yang awalnya dikembangkan oleh SoundCloud, sekarang menjadi project CNCF (Cloud Native Computing Foundation) yang sangat populer di ekosistem Kubernetes dan cloud-native.

Grafana adalah platform visualisasi dan analytics yang sangat powerful untuk membuat dashboard interaktif dari berbagai sumber data termasuk Prometheus. Bersama, keduanya membentuk standar emas untuk observability di era cloud-native.

Mengapa Monitoring Penting?

AspekTanpa MonitoringDengan Monitoring
Deteksi MasalahπŸ”΄ User melapor duluan🟒 Alert otomatis sebelum user tahu
Root Cause AnalysisπŸ”΄ Tebak-tebakan🟒 Data-driven, tahu persis kapan dan dimana
Capacity PlanningπŸ”΄ Over-provisioning atau under-provisioning🟒 Data-driven decisions
SLA/SLOπŸ”΄ Tidak bisa diukur🟒 Precise measurement & tracking
Performance TuningπŸ”΄ Tidak tahu bottleneck🟒 Profiling & targeted optimization

Pull vs Push Model

Prometheus menggunakan model pull β€” ia secara aktif mengambil (scrape) metrics dari target. Ini berbeda dari sistem push-based seperti Graphite atau StatsD:

Diagram: Pull vs Push Monitoring
  PUSH MODEL                        PULL MODEL (Prometheus)
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ App A    │──push──┐             β”‚  Prometheus   β”‚
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€        β”‚             β”‚  Server       β”‚
  β”‚ App B    │──pushβ”€β”€β”Όβ”€β”€β–Ίβ”Œβ”€β”€β”€β”€β”€β”€β”€β” β”‚               β”‚
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€        β”‚   β”‚Centralβ”‚ β”‚  Scrape       β”‚
  β”‚ App C    │──pushβ”€β”€β”˜   β”‚Server │◄─  (pull)       β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β””β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚               β”‚
                                    β”‚               β”‚
  Setiap app mengirim data          β”‚  /metrics ◄────── App A
  ke central server                 β”‚  /metrics ◄────── App B
  β†’ Bisa overload                   β”‚  /metrics ◄────── App C
  β†’ Data bisa hilang                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                    Server mengambil data dari apps
                                    β†’ Controlled rate
                                    β†’ Reliable (retry built-in)

2. Arsitektur Monitoring Stack

Diagram: Full Monitoring Stack Architecture
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  MONITORING STACK                             β”‚
β”‚                                                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚                 DATA SOURCES                          β”‚   β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”       β”‚   β”‚
β”‚  β”‚  β”‚ App A  β”‚ β”‚ App B  β”‚ β”‚ Node   β”‚ β”‚ MySQL  β”‚       β”‚   β”‚
β”‚  β”‚  β”‚/metricsβ”‚ β”‚/metricsβ”‚ β”‚Exporterβ”‚ β”‚Exporterβ”‚       β”‚   β”‚
β”‚  β”‚  β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜       β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚         β”‚          β”‚          β”‚          β”‚                   β”‚
β”‚         β–Ό          β–Ό          β–Ό          β–Ό                   β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚              PROMETHEUS SERVER                         β”‚   β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚   β”‚
β”‚  β”‚  β”‚  TSDB       β”‚  β”‚  Rules Engineβ”‚  β”‚  HTTP API  β”‚  β”‚   β”‚
β”‚  β”‚  β”‚  (Storage)  β”‚  β”‚  (Alerting)  β”‚  β”‚  (Query)   β”‚  β”‚   β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                             β”‚                  β”‚             β”‚
β”‚                             β–Ό                  β–Ό             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚Alertmanager  β”‚   β”‚   Grafana    β”‚   β”‚   PromQL     β”‚   β”‚
β”‚  β”‚              β”‚   β”‚  (Dashboard) β”‚   β”‚   (Query)    β”‚   β”‚
β”‚  β”‚β†’ Slack       β”‚   β”‚              β”‚   β”‚              β”‚   β”‚
β”‚  β”‚β†’ Email       β”‚   β”‚  Panels:     β”‚   β”‚  rate()      β”‚   β”‚
β”‚  β”‚β†’ PagerDuty   β”‚   β”‚  - Graphs    β”‚   β”‚  histogram   β”‚   β”‚
β”‚  β”‚β†’ Webhook     β”‚   β”‚  - Tables    β”‚   β”‚  quantile    β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚  - Gauges    β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Komponen utama monitoring stack:

3. Instalasi Stack Lengkap

Docker Compose (Development)

YAML
# docker-compose-monitoring.yaml
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:v2.53.0
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/rules:/etc/prometheus/rules
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
    restart: unless-stopped

  alertmanager:
    image: prom/alertmanager:v0.27.0
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
    restart: unless-stopped

  grafana:
    image: grafana/grafana:11.1.0
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=admin123
      - GF_INSTALL_PLUGINS=grafana-clock-panel
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    restart: unless-stopped

  node-exporter:
    image: prom/node-exporter:v1.8.1
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'
    restart: unless-stopped

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.49.1
    ports:
      - "8080:8080"
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker:/var/lib/docker:ro
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:

Prometheus Configuration

YAML
# prometheus/prometheus.yml
global:
  scrape_interval: 15s         # Default scrape interval
  evaluation_interval: 15s     # Rules evaluation interval
  scrape_timeout: 10s          # Scrape timeout

# Alert Rules
rule_files:
  - /etc/prometheus/rules/*.yml

# Alertmanager Configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

# Scrape Configurations
scrape_configs:
  # Prometheus self-monitoring
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node Exporter (system metrics)
  - job_name: 'node-exporter'
    static_configs:
      - targets:
          - 'node-exporter:9100'
        labels:
          environment: 'production'

  # cAdvisor (container metrics)
  - job_name: 'cadvisor'
    static_configs:
      - targets:
          - 'cadvisor:8080'

  # Application metrics
  - job_name: 'my-app'
    metrics_path: '/metrics'
    scrape_interval: 10s
    static_configs:
      - targets:
          - 'app:8080'
        labels:
          app: 'myapp'
          environment: 'production'

  # MySQL Exporter
  - job_name: 'mysql'
    static_configs:
      - targets:
          - 'mysql-exporter:9104'

  # Redis Exporter
  - job_name: 'redis'
    static_configs:
      - targets:
          - 'redis-exporter:9121'

  # NGINX Exporter
  - job_name: 'nginx'
    static_configs:
      - targets:
          - 'nginx-exporter:9113'

Kubernetes dengan Helm

Bash
# Tambah repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# Install kube-prometheus-stack (Prometheus + Grafana + Alertmanager + exporters)
helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set grafana.adminPassword=admin123 \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi

# Verifikasi
kubectl get pods -n monitoring

# Akses Grafana
kubectl port-forward -n monitoring svc/monitoring-grafana 3000:80
# Buka http://localhost:3000 (admin/admin123)

# Akses Prometheus
kubectl port-forward -n monitoring svc/monitoring-kube-prometheus-prometheus 9090:9090

4. Tipe Metric di Prometheus

Prometheus mendukung empat tipe metric dasar. Memahami perbedaannya sangat penting untuk menggunakan PromQL dengan benar.

TipePenjelasanContohPromQL
CounterNilai yang selalu naik (atau reset ke 0 saat restart)HTTP requests total, bytes sentrate()
GaugeNilai yang bisa naik dan turunTemperature, memory usage, queue sizeavg(), max()
HistogramSamples observations dalam bucket (distribusi nilai)Request duration, response sizehistogram_quantile()
SummarySeperti histogram tapi menghitung quantile di clientRequest duration quantilesLangsung akses quantiles
PromQL
# Counter β€” selalu gunakan rate() atau increase()
# Total HTTP requests per detik (5 menit terakhir)
rate(http_requests_total[5m])

# Total request dalam 1 jam terakhir
increase(http_requests_total[1h])

# Gauge β€” langsung query
# Memory usage saat ini
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100

# CPU load average
node_load1

# Histogram β€” gunakan histogram_quantile()
# 95th percentile request duration
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# 99th percentile
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# Average request duration
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])

5. Custom Metrics dari Aplikasi

Untuk memonitor aplikasi sendiri, Anda perlu menambahkan library client Prometheus ke dalam kode aplikasi dan mengekspos endpoint /metrics.

Python (Flask + prometheus_client)

Python
# app.py - Flask dengan Prometheus metrics
from flask import Flask, request
from prometheus_client import (
    Counter, Histogram, Gauge, Info,
    generate_latest, CONTENT_TYPE_LATEST
)
import time
import psutil

app = Flask(__name__)

# === METRIC DEFINITIONS ===

# Counter: total HTTP requests
http_requests_total = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status_code']
)

# Histogram: request duration
http_request_duration = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration in seconds',
    ['method', 'endpoint'],
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)

# Gauge: active connections
active_connections = Gauge(
    'app_active_connections',
    'Number of active connections'
)

# Gauge: queue size
queue_size = Gauge(
    'app_queue_size',
    'Current queue size',
    ['queue_name']
)

# Info: app version
app_info = Info(
    'app',
    'Application information'
)
app_info.info({
    'version': '1.0.0',
    'language': 'python',
    'framework': 'flask'
})

# === MIDDLEWARE ===
@app.before_request
def before_request():
    request._start_time = time.time()
    active_connections.inc()

@app.after_request
def after_request(response):
    # Record request duration
    duration = time.time() - request._start_time
    http_request_duration.labels(
        method=request.method,
        endpoint=request.path
    ).observe(duration)

    # Increment request counter
    http_requests_total.labels(
        method=request.method,
        endpoint=request.path,
        status_code=response.status_code
    ).inc()

    active_connections.dec()
    return response

# === METRICS ENDPOINT ===
@app.route('/metrics')
def metrics():
    return generate_latest(), 200, {'Content-Type': CONTENT_TYPE_LATEST}

# === APPLICATION ROUTES ===
@app.route('/')
def index():
    return {'status': 'ok', 'version': '1.0.0'}

@app.route('/api/users')
def get_users():
    time.sleep(0.1)  # Simulate work
    return {'users': ['user1', 'user2', 'user3']}

@app.route('/api/orders')
def get_orders():
    queue_size.labels(queue_name='orders').set(42)
    return {'orders': []}

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8080)

Node.js (Express + prom-client)

JavaScript
// server.js - Express dengan Prometheus metrics
const express = require('express');
const client = require('prom-client');

const app = express();
const PORT = 8080;

// === METRIC DEFINITIONS ===
const httpRequestsTotal = new client.Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status']
});

const httpRequestDuration = new client.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'route'],
  buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]
});

const activeConnections = new client.Gauge({
  name: 'app_active_connections',
  help: 'Number of active connections'
});

const dbQueryDuration = new client.Histogram({
  name: 'db_query_duration_seconds',
  help: 'Database query duration',
  labelNames: ['query_type', 'table'],
  buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0]
});

// Collect default metrics (CPU, memory, event loop lag)
client.collectDefaultMetrics({ prefix: 'nodejs_' });

// === MIDDLEWARE ===
app.use((req, res, next) => {
  const start = Date.now();
  activeConnections.inc();

  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    httpRequestsTotal.inc({
      method: req.method,
      route: req.path,
      status: res.statusCode
    });
    httpRequestDuration.observe(
      { method: req.method, route: req.path },
      duration
    );
    activeConnections.dec();
  });
  next();
});

// === METRICS ENDPOINT ===
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', client.register.contentType);
  res.send(await client.register.metrics());
});

// === APPLICATION ROUTES ===
app.get('/', (req, res) => {
  res.json({ status: 'ok', version: '1.0.0' });
});

app.get('/api/users', (req, res) => {
  // Simulate DB query timing
  const end = dbQueryDuration.startTimer({ query_type: 'SELECT', table: 'users' });
  // ... query database ...
  end();
  res.json({ users: ['user1', 'user2'] });
});

app.listen(PORT, () => {
  console.log(`App running on port ${PORT}`);
});

6. Prometheus Exporters

Exporters mengkonversi metrics dari sistem yang tidak mendukung Prometheus native ke format yang bisa di-scrape oleh Prometheus.

Exporters Populer

ExporterFungsiPort DefaultMetrics Contoh
node_exporterSystem metrics (CPU, RAM, disk, network)9100node_cpu_seconds_total
mysqld_exporterMySQL metrics9104mysql_global_status_queries
redis_exporterRedis metrics9121redis_connected_clients
nginx_exporterNGINX metrics9113nginx_http_requests_total
blackbox_exporterProbe endpoints (HTTP, TCP, ICMP)9115probe_http_status_code
cadvisorContainer metrics (Docker)8080container_cpu_usage_seconds_total
postgres_exporterPostgreSQL metrics9187pg_stat_activity_count
mongodb_exporterMongoDB metrics9216mongodb_connections

Instalasi Node Exporter (Docker)

Bash
# Jalankan node_exporter
docker run -d \
  --name node-exporter \
  --net=host \
  --pid=host \
  -v "/:/host:ro,rslave" \
  prom/node-exporter:v1.8.1 \
  --path.rootfs=/host

# Cek metrics
curl http://localhost:9100/metrics | head -50

# Contoh output:
# node_cpu_seconds_total{cpu="0",mode="idle"} 123456.78
# node_memory_MemTotal_bytes 8589934592
# node_filesystem_size_bytes{mountpoint="/"} 107374182400

Blackbox Exporter (HTTP Probing)

YAML
# blackbox.yml - Konfigurasi blackbox exporter
modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
      valid_status_codes: [200]
      method: GET
      follow_redirects: true
      preferred_ip_protocol: "ip4"

  http_post_2xx:
    prober: http
    http:
      method: POST
      valid_status_codes: [200, 201]

  tcp_connect:
    prober: tcp
    timeout: 5s

  icmp:
    prober: icmp
    timeout: 5s

---
# prometheus.yml β€” scrape config untuk blackbox
scrape_configs:
  - job_name: 'blackbox-http'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://app.example.com
          - https://api.example.com
          - https://blog.example.com
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

7. PromQL Advanced

PromQL (Prometheus Query Language) adalah bahasa query yang sangat powerful untuk mengekstrak dan menganalisis time-series data.

Rate & Increase

PromQL
# rate() β€” perhitungan per-detik untuk counter (wajib untuk counter!)
rate(http_requests_total[5m])

# Per-second rate per status code
rate(http_requests_total{status_code="500"}[5m])

# increase() β€” total increase dalam periode
increase(http_requests_total[1h])

# irate() β€” instant rate (gunakan 2 data point terakhir)
irate(http_requests_total[5m])

# Perbedaan rate vs irate:
# rate()  β†’ smooth, good untuk alerting
# irate() β†’ responsive, good untuk dashboards

Aggregation

PromQL
# sum β€” total requests across all instances
sum(rate(http_requests_total[5m]))

# sum by label β€” group by endpoint
sum by (endpoint) (rate(http_requests_total[5m]))

# avg β€” average CPU usage across nodes
avg(node_cpu_seconds_total{mode="idle"})

# max β€” highest memory usage
max(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)

# min β€” lowest disk space
min(node_filesystem_avail_bytes{mountpoint="/"})

# topk β€” top 5 busiest endpoints
topk(5, sum by (endpoint) (rate(http_requests_total[5m])))

# bottomk β€” bottom 3 slowest instances
bottomk(3, avg by (instance) (rate(http_request_duration_seconds_sum[5m])))

# count β€” number of instances
count(up == 1)

# stddev β€” standard deviation of response times
stddev(rate(http_request_duration_seconds_sum[5m]))

Advanced Queries

PromQL
# Error rate percentage
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
* 100

# Request duration percentiles
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))  # p50
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))  # p95
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))  # p99

# Saturation: CPU usage percentage per node
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage percentage
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# Disk usage percentage
(1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100

# Network throughput (MB/s)
rate(node_network_receive_bytes_total[5m]) / 1024 / 1024
rate(node_network_transmit_bytes_total[5m]) / 1024 / 1024

# Apdex Score (Application Performance Index)
(
  sum(rate(http_request_duration_seconds_bucket{le="0.1"}[5m]))
  +
  sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
) / 2
/
sum(rate(http_request_duration_seconds_count[5m]))

# Predict disk full in 4 hours
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[1h], 4*3600) < 0

8. Alerting Rules & Alertmanager

Alerting Rules

YAML
# prometheus/rules/alerts.yml
groups:
  - name: application_alerts
    rules:
      # High Error Rate
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status_code=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))
          > 0.05
        for: 5m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "High 5xx error rate on {{ $labels.instance }}"
          description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"
          runbook_url: "https://wiki.example.com/runbooks/high-error-rate"

      # High Latency
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High p95 latency on {{ $labels.instance }}"
          description: "p95 latency is {{ $value }}s (threshold: 2s)"

      # Service Down
      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.instance }} is DOWN"
          description: "{{ $labels.job }} on {{ $labels.instance }} has been down for more than 1 minute"

  - name: infrastructure_alerts
    rules:
      # High CPU Usage
      - alert: HighCPUUsage
        expr: |
          100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value | printf \"%.1f\" }}%"

      # Low Memory
      - alert: LowMemory
        expr: |
          (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Low memory on {{ $labels.instance }}"
          description: "Memory usage is {{ $value | printf \"%.1f\" }}%"

      # Disk Almost Full
      - alert: DiskAlmostFull
        expr: |
          (1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Disk almost full on {{ $labels.instance }}"
          description: "Disk usage is {{ $value | printf \"%.1f\" }}%"

      # Disk will be full in 4 hours
      - alert: DiskWillFillSoon
        expr: predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 4*3600) < 0
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Disk on {{ $labels.instance }} predicted to fill within 4 hours"

      # Container Restart Loop
      - alert: ContainerRestartLoop
        expr: increase(kube_pod_container_status_restarts_total[1h]) > 3
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Container {{ $labels.container }} restarting frequently"
          description: "Pod {{ $labels.pod }} restarted {{ $value }} times in the last hour"

Alertmanager Configuration

YAML
# alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'alerts@example.com'
  smtp_auth_username: 'alerts@example.com'
  smtp_auth_password: 'app-password'

# Routing rules
route:
  receiver: 'default-slack'
  group_by: ['alertname', 'instance']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    # Critical alerts β†’ PagerDuty + Slack
    - match:
        severity: critical
      receiver: 'critical-pagerduty'
      group_wait: 10s
      repeat_interval: 1h

    # Warning alerts β†’ Slack only
    - match:
        severity: warning
      receiver: 'warning-slack'
      repeat_interval: 4h

    # Team-specific routing
    - match:
        team: backend
      receiver: 'backend-slack'

# Receivers
receivers:
  - name: 'default-slack'
    slack_configs:
      - channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: 'critical-pagerduty'
    pagerduty_configs:
      - service_key: 'your-pagerduty-key'
        severity: 'critical'
    slack_configs:
      - channel: '#alerts-critical'
        color: 'danger'

  - name: 'warning-slack'
    slack_configs:
      - channel: '#alerts-warning'
        color: 'warning'

  - name: 'backend-slack'
    slack_configs:
      - channel: '#backend-alerts'

# Inhibition rules
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

9. Membuat Grafana Dashboard

Provisioning Datasource

YAML
# grafana/provisioning/datasources/prometheus.yaml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true

  - name: Alertmanager
    type: alertmanager
    access: proxy
    url: http://alertmanager:9093
    jsonData:
      implementation: prometheus

Dashboard JSON (Node Overview)

JSON
{
  "dashboard": {
    "title": "Node Overview",
    "tags": ["infrastructure", "node"],
    "timezone": "browser",
    "panels": [
      {
        "title": "CPU Usage",
        "type": "timeseries",
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
        "targets": [
          {
            "expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
            "legendFormat": "{{instance}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "max": 100,
            "thresholds": {
              "steps": [
                {"value": null, "color": "green"},
                {"value": 70, "color": "yellow"},
                {"value": 85, "color": "red"}
              ]
            }
          }
        }
      },
      {
        "title": "Memory Usage",
        "type": "gauge",
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
        "targets": [
          {
            "expr": "(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100",
            "legendFormat": "{{instance}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "min": 0,
            "max": 100,
            "thresholds": {
              "steps": [
                {"value": null, "color": "green"},
                {"value": 80, "color": "yellow"},
                {"value": 90, "color": "red"}
              ]
            }
          }
        }
      },
      {
        "title": "Disk Usage",
        "type": "bargauge",
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 8},
        "targets": [
          {
            "expr": "(1 - node_filesystem_avail_bytes{mountpoint=\"/\"} / node_filesystem_size_bytes{mountpoint=\"/\"}) * 100",
            "legendFormat": "{{instance}}"
          }
        ]
      },
      {
        "title": "Network I/O",
        "type": "timeseries",
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 8},
        "targets": [
          {
            "expr": "rate(node_network_receive_bytes_total[5m])",
            "legendFormat": "RX {{instance}}"
          },
          {
            "expr": "rate(node_network_transmit_bytes_total[5m])",
            "legendFormat": "TX {{instance}}"
          }
        ],
        "fieldConfig": {
          "defaults": {"unit": "Bps"}
        }
      }
    ],
    "refresh": "30s",
    "time": {"from": "now-1h", "to": "now"}
  }
}

10. Grafana Advanced Features

Variables (Template)

Variables memungkinkan dashboard menjadi dinamis dan bisa digunakan untuk filter:

PromQL (Variable Query)
# Variable: $instance
# Query: label_values(up, instance)
# Result: web1:9100, web2:9100, db1:9100

# Variable: $job
# Query: label_values(up, job)
# Result: node-exporter, prometheus, mysql

# Variable: $interval
# Type: Interval
# Values: 1m, 5m, 15m, 1h, 6h, 1d

# Menggunakan variables di query:
rate(http_requests_total{instance="$instance"}[$interval])
avg by (job) (rate(http_requests_total[$interval]))
node_memory_MemAvailable_bytes{instance=~"$instance"}

Annotations

YAML
# Grafana annotations dari Prometheus alerts
# Otomatis muncul di dashboard ketika alert triggered

# Custom annotations dari data source
# Query: ALERTS{alertstate="firing"}

# Manual annotations dari API
curl -X POST http://admin:admin123@localhost:3000/api/annotations \
  -H "Content-Type: application/json" \
  -d '{
    "dashboardUID": "abc123",
    "time": 1625097600000,
    "text": "Deployed v2.0.0",
    "tags": ["deploy", "v2.0.0"]
  }'

Panel Types Populer

Panel TypeBest ForContoh Penggunaan
Time SeriesData berubah over timeCPU, memory, request rate
GaugeNilai saat ini vs thresholdDisk usage, SLA compliance
StatSingle numberTotal requests, uptime
Bar GaugePerbandingan antar itemCPU per instance
TableData tabularTop 10 slowest endpoints
HeatmapDistribusi densityRequest latency distribution
Pie ChartDistribusi persentaseTraffic by status code
LogsLog explorationApplication logs dari Loki

Grafana Alerting (Unified Alerting)

YAML
# Grafana provisioning alert rules
# grafana/provisioning/alerting/rules.yaml
apiVersion: 1

groups:
  - orgId: 1
    name: Application Alerts
    folder: Monitoring
    interval: 1m
    rules:
      - uid: high-error-rate
        title: High Error Rate
        condition: C
        data:
          - refId: A
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: prometheus
            model:
              expr: 'sum(rate(http_requests_total{status_code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100'
              instant: true
          - refId: C
            relativeTimeRange:
              from: 300
              to: 0
            datasourceUid: __expr__
            model:
              type: threshold
              expression: A
              conditions:
                - evaluator:
                    type: gt
                    params: [5]
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High 5xx error rate detected"

11. Best Practices

Four Golden Signals (Google SRE)

SignalPenjelasanPrometheus Metric
LatencyWaktu respons requesthistogram_quantile(0.95, ...)
TrafficBanyaknya request yang dilayanirate(http_requests_total[5m])
ErrorsRate request yang gagalrate(http_requests_total{status=~"5.."}[5m])
SaturationSeberapa penuh resourceCPU%, memory%, disk%

Monitoring Checklist

⚠️ Peringatan

Hindari high-cardinality labels (seperti user_id, request_id) karena bisa menyebabkan Prometheus kehabisan memory. Gunakan label yang memiliki nilai terbatas (endpoint, method, status_code).

12. Quiz Pemahaman

1. Model apa yang digunakan Prometheus untuk mengumpulkan metrics?

2. Fungsi PromQL apa yang harus digunakan untuk metric tipe Counter?

3. Apa fungsi Alertmanager?

4. Apa itu "Four Golden Signals" dalam monitoring?

5. Mengapa high-cardinality labels berbahaya untuk Prometheus?

πŸ” Zoom
100%
🎨 Tema