1. Pengenalan Prometheus & Grafana
Prometheus adalah sistem monitoring dan alerting open-source yang awalnya dikembangkan oleh SoundCloud, sekarang menjadi project CNCF (Cloud Native Computing Foundation) yang sangat populer di ekosistem Kubernetes dan cloud-native.
Grafana adalah platform visualisasi dan analytics yang sangat powerful untuk membuat dashboard interaktif dari berbagai sumber data termasuk Prometheus. Bersama, keduanya membentuk standar emas untuk observability di era cloud-native.
Mengapa Monitoring Penting?
| Aspek | Tanpa Monitoring | Dengan Monitoring |
|---|---|---|
| Deteksi Masalah | π΄ User melapor duluan | π’ Alert otomatis sebelum user tahu |
| Root Cause Analysis | π΄ Tebak-tebakan | π’ Data-driven, tahu persis kapan dan dimana |
| Capacity Planning | π΄ Over-provisioning atau under-provisioning | π’ Data-driven decisions |
| SLA/SLO | π΄ Tidak bisa diukur | π’ Precise measurement & tracking |
| Performance Tuning | π΄ Tidak tahu bottleneck | π’ Profiling & targeted optimization |
Pull vs Push Model
Prometheus menggunakan model pull β ia secara aktif mengambil (scrape) metrics dari target. Ini berbeda dari sistem push-based seperti Graphite atau StatsD:
PUSH MODEL PULL MODEL (Prometheus)
ββββββββββββ ββββββββββββββββ
β App A βββpushβββ β Prometheus β
ββββββββββββ€ β β Server β
β App B βββpushβββΌβββΊβββββββββ β β
ββββββββββββ€ β βCentralβ β Scrape β
β App C βββpushβββ βServer βββ€ (pull) β
ββββββββββββ βββββββββ β β
β β
Setiap app mengirim data β /metrics βββββ€ββ App A
ke central server β /metrics βββββ€ββ App B
β Bisa overload β /metrics βββββ€ββ App C
β Data bisa hilang ββββββββββββββββ
Server mengambil data dari apps
β Controlled rate
β Reliable (retry built-in)
2. Arsitektur Monitoring Stack
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β MONITORING STACK β β β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β DATA SOURCES β β β β ββββββββββ ββββββββββ ββββββββββ ββββββββββ β β β β β App A β β App B β β Node β β MySQL β β β β β β/metricsβ β/metricsβ βExporterβ βExporterβ β β β β βββββ¬βββββ βββββ¬βββββ βββββ¬βββββ βββββ¬βββββ β β β ββββββββΌβββββββββββΌβββββββββββΌβββββββββββΌβββββββββββββββ β β β β β β β β βΌ βΌ βΌ βΌ β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β PROMETHEUS SERVER β β β β βββββββββββββββ ββββββββββββββββ ββββββββββββββ β β β β β TSDB β β Rules Engineβ β HTTP API β β β β β β (Storage) β β (Alerting) β β (Query) β β β β β βββββββββββββββ ββββββββ¬ββββββββ ββββββββ¬ββββββ β β β ββββββββββββββββββββββββββββΌβββββββββββββββββββΌβββββββββ β β β β β β βΌ βΌ β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β β βAlertmanager β β Grafana β β PromQL β β β β β β (Dashboard) β β (Query) β β β ββ Slack β β β β β β β ββ Email β β Panels: β β rate() β β β ββ PagerDuty β β - Graphs β β histogram β β β ββ Webhook β β - Tables β β quantile β β β ββββββββββββββββ β - Gauges β ββββββββββββββββ β β ββββββββββββββββ β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Komponen utama monitoring stack:
- Prometheus Server β mengumpulkan dan menyimpan time-series data
- Exporters β mengkonversi metrics dari sistem lain ke format Prometheus
- Alertmanager β mengelola dan mengirim alert notifications
- Grafana β visualisasi dan dashboard interaktif
- Pushgateway β untuk short-lived jobs yang tidak bisa di-scrape
3. Instalasi Stack Lengkap
Docker Compose (Development)
# docker-compose-monitoring.yaml
version: '3.8'
services:
prometheus:
image: prom/prometheus:v2.53.0
ports:
- "9090:9090"
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- ./prometheus/rules:/etc/prometheus/rules
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
restart: unless-stopped
alertmanager:
image: prom/alertmanager:v0.27.0
ports:
- "9093:9093"
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
restart: unless-stopped
grafana:
image: grafana/grafana:11.1.0
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=admin123
- GF_INSTALL_PLUGINS=grafana-clock-panel
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
restart: unless-stopped
node-exporter:
image: prom/node-exporter:v1.8.1
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/rootfs'
restart: unless-stopped
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.49.1
ports:
- "8080:8080"
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker:/var/lib/docker:ro
restart: unless-stopped
volumes:
prometheus_data:
grafana_data:
Prometheus Configuration
# prometheus/prometheus.yml
global:
scrape_interval: 15s # Default scrape interval
evaluation_interval: 15s # Rules evaluation interval
scrape_timeout: 10s # Scrape timeout
# Alert Rules
rule_files:
- /etc/prometheus/rules/*.yml
# Alertmanager Configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# Scrape Configurations
scrape_configs:
# Prometheus self-monitoring
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Node Exporter (system metrics)
- job_name: 'node-exporter'
static_configs:
- targets:
- 'node-exporter:9100'
labels:
environment: 'production'
# cAdvisor (container metrics)
- job_name: 'cadvisor'
static_configs:
- targets:
- 'cadvisor:8080'
# Application metrics
- job_name: 'my-app'
metrics_path: '/metrics'
scrape_interval: 10s
static_configs:
- targets:
- 'app:8080'
labels:
app: 'myapp'
environment: 'production'
# MySQL Exporter
- job_name: 'mysql'
static_configs:
- targets:
- 'mysql-exporter:9104'
# Redis Exporter
- job_name: 'redis'
static_configs:
- targets:
- 'redis-exporter:9121'
# NGINX Exporter
- job_name: 'nginx'
static_configs:
- targets:
- 'nginx-exporter:9113'
Kubernetes dengan Helm
# Tambah repository helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo add grafana https://grafana.github.io/helm-charts helm repo update # Install kube-prometheus-stack (Prometheus + Grafana + Alertmanager + exporters) helm install monitoring prometheus-community/kube-prometheus-stack \ --namespace monitoring \ --create-namespace \ --set grafana.adminPassword=admin123 \ --set prometheus.prometheusSpec.retention=30d \ --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi # Verifikasi kubectl get pods -n monitoring # Akses Grafana kubectl port-forward -n monitoring svc/monitoring-grafana 3000:80 # Buka http://localhost:3000 (admin/admin123) # Akses Prometheus kubectl port-forward -n monitoring svc/monitoring-kube-prometheus-prometheus 9090:9090
4. Tipe Metric di Prometheus
Prometheus mendukung empat tipe metric dasar. Memahami perbedaannya sangat penting untuk menggunakan PromQL dengan benar.
| Tipe | Penjelasan | Contoh | PromQL |
|---|---|---|---|
| Counter | Nilai yang selalu naik (atau reset ke 0 saat restart) | HTTP requests total, bytes sent | rate() |
| Gauge | Nilai yang bisa naik dan turun | Temperature, memory usage, queue size | avg(), max() |
| Histogram | Samples observations dalam bucket (distribusi nilai) | Request duration, response size | histogram_quantile() |
| Summary | Seperti histogram tapi menghitung quantile di client | Request duration quantiles | Langsung akses quantiles |
# Counter β selalu gunakan rate() atau increase() # Total HTTP requests per detik (5 menit terakhir) rate(http_requests_total[5m]) # Total request dalam 1 jam terakhir increase(http_requests_total[1h]) # Gauge β langsung query # Memory usage saat ini node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 # CPU load average node_load1 # Histogram β gunakan histogram_quantile() # 95th percentile request duration histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) # 99th percentile histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) # Average request duration rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])
5. Custom Metrics dari Aplikasi
Untuk memonitor aplikasi sendiri, Anda perlu menambahkan library client Prometheus ke dalam kode aplikasi dan mengekspos endpoint /metrics.
Python (Flask + prometheus_client)
# app.py - Flask dengan Prometheus metrics
from flask import Flask, request
from prometheus_client import (
Counter, Histogram, Gauge, Info,
generate_latest, CONTENT_TYPE_LATEST
)
import time
import psutil
app = Flask(__name__)
# === METRIC DEFINITIONS ===
# Counter: total HTTP requests
http_requests_total = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status_code']
)
# Histogram: request duration
http_request_duration = Histogram(
'http_request_duration_seconds',
'HTTP request duration in seconds',
['method', 'endpoint'],
buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)
# Gauge: active connections
active_connections = Gauge(
'app_active_connections',
'Number of active connections'
)
# Gauge: queue size
queue_size = Gauge(
'app_queue_size',
'Current queue size',
['queue_name']
)
# Info: app version
app_info = Info(
'app',
'Application information'
)
app_info.info({
'version': '1.0.0',
'language': 'python',
'framework': 'flask'
})
# === MIDDLEWARE ===
@app.before_request
def before_request():
request._start_time = time.time()
active_connections.inc()
@app.after_request
def after_request(response):
# Record request duration
duration = time.time() - request._start_time
http_request_duration.labels(
method=request.method,
endpoint=request.path
).observe(duration)
# Increment request counter
http_requests_total.labels(
method=request.method,
endpoint=request.path,
status_code=response.status_code
).inc()
active_connections.dec()
return response
# === METRICS ENDPOINT ===
@app.route('/metrics')
def metrics():
return generate_latest(), 200, {'Content-Type': CONTENT_TYPE_LATEST}
# === APPLICATION ROUTES ===
@app.route('/')
def index():
return {'status': 'ok', 'version': '1.0.0'}
@app.route('/api/users')
def get_users():
time.sleep(0.1) # Simulate work
return {'users': ['user1', 'user2', 'user3']}
@app.route('/api/orders')
def get_orders():
queue_size.labels(queue_name='orders').set(42)
return {'orders': []}
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8080)
Node.js (Express + prom-client)
// server.js - Express dengan Prometheus metrics
const express = require('express');
const client = require('prom-client');
const app = express();
const PORT = 8080;
// === METRIC DEFINITIONS ===
const httpRequestsTotal = new client.Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'route', 'status']
});
const httpRequestDuration = new client.Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration in seconds',
labelNames: ['method', 'route'],
buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]
});
const activeConnections = new client.Gauge({
name: 'app_active_connections',
help: 'Number of active connections'
});
const dbQueryDuration = new client.Histogram({
name: 'db_query_duration_seconds',
help: 'Database query duration',
labelNames: ['query_type', 'table'],
buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0]
});
// Collect default metrics (CPU, memory, event loop lag)
client.collectDefaultMetrics({ prefix: 'nodejs_' });
// === MIDDLEWARE ===
app.use((req, res, next) => {
const start = Date.now();
activeConnections.inc();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
httpRequestsTotal.inc({
method: req.method,
route: req.path,
status: res.statusCode
});
httpRequestDuration.observe(
{ method: req.method, route: req.path },
duration
);
activeConnections.dec();
});
next();
});
// === METRICS ENDPOINT ===
app.get('/metrics', async (req, res) => {
res.set('Content-Type', client.register.contentType);
res.send(await client.register.metrics());
});
// === APPLICATION ROUTES ===
app.get('/', (req, res) => {
res.json({ status: 'ok', version: '1.0.0' });
});
app.get('/api/users', (req, res) => {
// Simulate DB query timing
const end = dbQueryDuration.startTimer({ query_type: 'SELECT', table: 'users' });
// ... query database ...
end();
res.json({ users: ['user1', 'user2'] });
});
app.listen(PORT, () => {
console.log(`App running on port ${PORT}`);
});
6. Prometheus Exporters
Exporters mengkonversi metrics dari sistem yang tidak mendukung Prometheus native ke format yang bisa di-scrape oleh Prometheus.
Exporters Populer
| Exporter | Fungsi | Port Default | Metrics Contoh |
|---|---|---|---|
| node_exporter | System metrics (CPU, RAM, disk, network) | 9100 | node_cpu_seconds_total |
| mysqld_exporter | MySQL metrics | 9104 | mysql_global_status_queries |
| redis_exporter | Redis metrics | 9121 | redis_connected_clients |
| nginx_exporter | NGINX metrics | 9113 | nginx_http_requests_total |
| blackbox_exporter | Probe endpoints (HTTP, TCP, ICMP) | 9115 | probe_http_status_code |
| cadvisor | Container metrics (Docker) | 8080 | container_cpu_usage_seconds_total |
| postgres_exporter | PostgreSQL metrics | 9187 | pg_stat_activity_count |
| mongodb_exporter | MongoDB metrics | 9216 | mongodb_connections |
Instalasi Node Exporter (Docker)
# Jalankan node_exporter
docker run -d \
--name node-exporter \
--net=host \
--pid=host \
-v "/:/host:ro,rslave" \
prom/node-exporter:v1.8.1 \
--path.rootfs=/host
# Cek metrics
curl http://localhost:9100/metrics | head -50
# Contoh output:
# node_cpu_seconds_total{cpu="0",mode="idle"} 123456.78
# node_memory_MemTotal_bytes 8589934592
# node_filesystem_size_bytes{mountpoint="/"} 107374182400
Blackbox Exporter (HTTP Probing)
# blackbox.yml - Konfigurasi blackbox exporter
modules:
http_2xx:
prober: http
timeout: 5s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
valid_status_codes: [200]
method: GET
follow_redirects: true
preferred_ip_protocol: "ip4"
http_post_2xx:
prober: http
http:
method: POST
valid_status_codes: [200, 201]
tcp_connect:
prober: tcp
timeout: 5s
icmp:
prober: icmp
timeout: 5s
---
# prometheus.yml β scrape config untuk blackbox
scrape_configs:
- job_name: 'blackbox-http'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://app.example.com
- https://api.example.com
- https://blog.example.com
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
7. PromQL Advanced
PromQL (Prometheus Query Language) adalah bahasa query yang sangat powerful untuk mengekstrak dan menganalisis time-series data.
Rate & Increase
# rate() β perhitungan per-detik untuk counter (wajib untuk counter!)
rate(http_requests_total[5m])
# Per-second rate per status code
rate(http_requests_total{status_code="500"}[5m])
# increase() β total increase dalam periode
increase(http_requests_total[1h])
# irate() β instant rate (gunakan 2 data point terakhir)
irate(http_requests_total[5m])
# Perbedaan rate vs irate:
# rate() β smooth, good untuk alerting
# irate() β responsive, good untuk dashboards
Aggregation
# sum β total requests across all instances
sum(rate(http_requests_total[5m]))
# sum by label β group by endpoint
sum by (endpoint) (rate(http_requests_total[5m]))
# avg β average CPU usage across nodes
avg(node_cpu_seconds_total{mode="idle"})
# max β highest memory usage
max(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
# min β lowest disk space
min(node_filesystem_avail_bytes{mountpoint="/"})
# topk β top 5 busiest endpoints
topk(5, sum by (endpoint) (rate(http_requests_total[5m])))
# bottomk β bottom 3 slowest instances
bottomk(3, avg by (instance) (rate(http_request_duration_seconds_sum[5m])))
# count β number of instances
count(up == 1)
# stddev β standard deviation of response times
stddev(rate(http_request_duration_seconds_sum[5m]))
Advanced Queries
# Error rate percentage
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
* 100
# Request duration percentiles
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m])) # p50
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) # p95
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) # p99
# Saturation: CPU usage percentage per node
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage percentage
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
# Disk usage percentage
(1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100
# Network throughput (MB/s)
rate(node_network_receive_bytes_total[5m]) / 1024 / 1024
rate(node_network_transmit_bytes_total[5m]) / 1024 / 1024
# Apdex Score (Application Performance Index)
(
sum(rate(http_request_duration_seconds_bucket{le="0.1"}[5m]))
+
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
) / 2
/
sum(rate(http_request_duration_seconds_count[5m]))
# Predict disk full in 4 hours
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[1h], 4*3600) < 0
8. Alerting Rules & Alertmanager
Alerting Rules
# prometheus/rules/alerts.yml
groups:
- name: application_alerts
rules:
# High Error Rate
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.05
for: 5m
labels:
severity: critical
team: backend
annotations:
summary: "High 5xx error rate on {{ $labels.instance }}"
description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"
runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
# High Latency
- alert: HighLatency
expr: |
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
for: 10m
labels:
severity: warning
annotations:
summary: "High p95 latency on {{ $labels.instance }}"
description: "p95 latency is {{ $value }}s (threshold: 2s)"
# Service Down
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.instance }} is DOWN"
description: "{{ $labels.job }} on {{ $labels.instance }} has been down for more than 1 minute"
- name: infrastructure_alerts
rules:
# High CPU Usage
- alert: HighCPUUsage
expr: |
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value | printf \"%.1f\" }}%"
# Low Memory
- alert: LowMemory
expr: |
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
for: 5m
labels:
severity: critical
annotations:
summary: "Low memory on {{ $labels.instance }}"
description: "Memory usage is {{ $value | printf \"%.1f\" }}%"
# Disk Almost Full
- alert: DiskAlmostFull
expr: |
(1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 > 85
for: 10m
labels:
severity: warning
annotations:
summary: "Disk almost full on {{ $labels.instance }}"
description: "Disk usage is {{ $value | printf \"%.1f\" }}%"
# Disk will be full in 4 hours
- alert: DiskWillFillSoon
expr: predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 4*3600) < 0
for: 30m
labels:
severity: warning
annotations:
summary: "Disk on {{ $labels.instance }} predicted to fill within 4 hours"
# Container Restart Loop
- alert: ContainerRestartLoop
expr: increase(kube_pod_container_status_restarts_total[1h]) > 3
for: 5m
labels:
severity: critical
annotations:
summary: "Container {{ $labels.container }} restarting frequently"
description: "Pod {{ $labels.pod }} restarted {{ $value }} times in the last hour"
Alertmanager Configuration
# alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: 'alerts@example.com'
smtp_auth_username: 'alerts@example.com'
smtp_auth_password: 'app-password'
# Routing rules
route:
receiver: 'default-slack'
group_by: ['alertname', 'instance']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
# Critical alerts β PagerDuty + Slack
- match:
severity: critical
receiver: 'critical-pagerduty'
group_wait: 10s
repeat_interval: 1h
# Warning alerts β Slack only
- match:
severity: warning
receiver: 'warning-slack'
repeat_interval: 4h
# Team-specific routing
- match:
team: backend
receiver: 'backend-slack'
# Receivers
receivers:
- name: 'default-slack'
slack_configs:
- channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'critical-pagerduty'
pagerduty_configs:
- service_key: 'your-pagerduty-key'
severity: 'critical'
slack_configs:
- channel: '#alerts-critical'
color: 'danger'
- name: 'warning-slack'
slack_configs:
- channel: '#alerts-warning'
color: 'warning'
- name: 'backend-slack'
slack_configs:
- channel: '#backend-alerts'
# Inhibition rules
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
9. Membuat Grafana Dashboard
Provisioning Datasource
# grafana/provisioning/datasources/prometheus.yaml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: true
- name: Alertmanager
type: alertmanager
access: proxy
url: http://alertmanager:9093
jsonData:
implementation: prometheus
Dashboard JSON (Node Overview)
{
"dashboard": {
"title": "Node Overview",
"tags": ["infrastructure", "node"],
"timezone": "browser",
"panels": [
{
"title": "CPU Usage",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
"targets": [
{
"expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "{{instance}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"max": 100,
"thresholds": {
"steps": [
{"value": null, "color": "green"},
{"value": 70, "color": "yellow"},
{"value": 85, "color": "red"}
]
}
}
}
},
{
"title": "Memory Usage",
"type": "gauge",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
"targets": [
{
"expr": "(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100",
"legendFormat": "{{instance}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"thresholds": {
"steps": [
{"value": null, "color": "green"},
{"value": 80, "color": "yellow"},
{"value": 90, "color": "red"}
]
}
}
}
},
{
"title": "Disk Usage",
"type": "bargauge",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 8},
"targets": [
{
"expr": "(1 - node_filesystem_avail_bytes{mountpoint=\"/\"} / node_filesystem_size_bytes{mountpoint=\"/\"}) * 100",
"legendFormat": "{{instance}}"
}
]
},
{
"title": "Network I/O",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 8},
"targets": [
{
"expr": "rate(node_network_receive_bytes_total[5m])",
"legendFormat": "RX {{instance}}"
},
{
"expr": "rate(node_network_transmit_bytes_total[5m])",
"legendFormat": "TX {{instance}}"
}
],
"fieldConfig": {
"defaults": {"unit": "Bps"}
}
}
],
"refresh": "30s",
"time": {"from": "now-1h", "to": "now"}
}
}
10. Grafana Advanced Features
Variables (Template)
Variables memungkinkan dashboard menjadi dinamis dan bisa digunakan untuk filter:
# Variable: $instance
# Query: label_values(up, instance)
# Result: web1:9100, web2:9100, db1:9100
# Variable: $job
# Query: label_values(up, job)
# Result: node-exporter, prometheus, mysql
# Variable: $interval
# Type: Interval
# Values: 1m, 5m, 15m, 1h, 6h, 1d
# Menggunakan variables di query:
rate(http_requests_total{instance="$instance"}[$interval])
avg by (job) (rate(http_requests_total[$interval]))
node_memory_MemAvailable_bytes{instance=~"$instance"}
Annotations
# Grafana annotations dari Prometheus alerts
# Otomatis muncul di dashboard ketika alert triggered
# Custom annotations dari data source
# Query: ALERTS{alertstate="firing"}
# Manual annotations dari API
curl -X POST http://admin:admin123@localhost:3000/api/annotations \
-H "Content-Type: application/json" \
-d '{
"dashboardUID": "abc123",
"time": 1625097600000,
"text": "Deployed v2.0.0",
"tags": ["deploy", "v2.0.0"]
}'
Panel Types Populer
| Panel Type | Best For | Contoh Penggunaan |
|---|---|---|
| Time Series | Data berubah over time | CPU, memory, request rate |
| Gauge | Nilai saat ini vs threshold | Disk usage, SLA compliance |
| Stat | Single number | Total requests, uptime |
| Bar Gauge | Perbandingan antar item | CPU per instance |
| Table | Data tabular | Top 10 slowest endpoints |
| Heatmap | Distribusi density | Request latency distribution |
| Pie Chart | Distribusi persentase | Traffic by status code |
| Logs | Log exploration | Application logs dari Loki |
Grafana Alerting (Unified Alerting)
# Grafana provisioning alert rules
# grafana/provisioning/alerting/rules.yaml
apiVersion: 1
groups:
- orgId: 1
name: Application Alerts
folder: Monitoring
interval: 1m
rules:
- uid: high-error-rate
title: High Error Rate
condition: C
data:
- refId: A
relativeTimeRange:
from: 300
to: 0
datasourceUid: prometheus
model:
expr: 'sum(rate(http_requests_total{status_code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100'
instant: true
- refId: C
relativeTimeRange:
from: 300
to: 0
datasourceUid: __expr__
model:
type: threshold
expression: A
conditions:
- evaluator:
type: gt
params: [5]
for: 5m
labels:
severity: critical
annotations:
summary: "High 5xx error rate detected"
11. Best Practices
Four Golden Signals (Google SRE)
| Signal | Penjelasan | Prometheus Metric |
|---|---|---|
| Latency | Waktu respons request | histogram_quantile(0.95, ...) |
| Traffic | Banyaknya request yang dilayani | rate(http_requests_total[5m]) |
| Errors | Rate request yang gagal | rate(http_requests_total{status=~"5.."}[5m]) |
| Saturation | Seberapa penuh resource | CPU%, memory%, disk% |
Monitoring Checklist
- Monitor semua 4 golden signals: latency, traffic, errors, saturation
- Gunakan
rate()untuk counter, jangan query raw counter - Set retention sesuai kebutuhan (default 15 hari, production: 30 hari)
- Gunakan recording rules untuk query yang sering dipakai
- Tiered alerting: warning (Slack) β critical (PagerDuty + Slack)
- Test alert rules dengan
promtool - Gunakan dashboard variables untuk filtering
- Simpan dashboard sebagai code (JSON di Git)
Hindari high-cardinality labels (seperti user_id, request_id) karena bisa menyebabkan Prometheus kehabisan memory. Gunakan label yang memiliki nilai terbatas (endpoint, method, status_code).
12. Quiz Pemahaman
1. Model apa yang digunakan Prometheus untuk mengumpulkan metrics?
2. Fungsi PromQL apa yang harus digunakan untuk metric tipe Counter?
3. Apa fungsi Alertmanager?
4. Apa itu "Four Golden Signals" dalam monitoring?
5. Mengapa high-cardinality labels berbahaya untuk Prometheus?