1. Unified Observability
Grafana Stack menyediakan unified observability dengan tiga pilar: Metrics (Mimir), Traces (Tempo), dan Logs (Loki). Ketiganya terintegrasi di Grafana untuk correlated investigation — Anda bisa drill-down dari metric anomaly ke trace ke log hanya dengan beberapa klik.
Prometheus compatible
Long-term retention
Trace-to-logs
TraceQL
Label-based indexing
LogQL
Unified dashboards
Alerting
| Component | Signal | Query Language | Collection Agent |
|---|---|---|---|
| Mimir | Metrics (numeric time-series) | PromQL | Prometheus, Alloy, OTel |
| Tempo | Traces (request flow) | TraceQL | OTel Collector, Grafana Agent |
| Loki | Logs (text events) | LogQL | Promtail, Alloy, Fluentd |
2. Mimir — Metrics
Grafana Mimir adalah TSDB (time-series database) yang kompatibel dengan Prometheus, mendukung long-term storage, multi-tenancy, dan horizontal scaling.
# Tambahkan Grafana Helm repo
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
# Install Mimir (monolithic mode untuk development)
helm install mimir grafana/mimir-distributed \
--namespace monitoring --create-namespace \
--set mimir.structuredConfig.multitenancy_enabled=false \
--set minio.enabled=true
# Production mode: microservices
# values-mimir-production.yaml
mimir:
structuredConfig:
multitenancy_enabled: true
alertmanager_storage:
s3:
bucket_name: mimir-alertmanager
blocks_storage:
s3:
bucket_name: mimir-blocks
compactor:
sharding_ring:
kvstore:
store: memberlist
distributor:
sharding_ring:
kvstore:
store: memberlist
ingester:
lifecycler:
ring:
kvstore:
store: memberlist
replication_factor: 3
# Install dengan production values
helm install mimir grafana/mimir-distributed \
--namespace monitoring \
--values values-mimir-production.yaml
# Konfigurasi Prometheus untuk remote-write ke Mimir
# prometheus.yml
remote_write:
- url: http://mimir-nginx.monitoring.svc:80/api/v1/push
queue_config:
max_samples_per_send: 5000
batch_send_deadline: 5s
2.1 PromQL Queries di Grafana
# Request rate per detik
sum(rate(http_requests_total[5m])) by (service)
# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100
# P99 latency
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
)
# Memory usage per pod
container_memory_working_set_bytes{container!=""} /
container_spec_memory_limit_bytes{container!=""} * 100
# CPU usage per node
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Disk IOPS
sum(rate(node_disk_reads_completed_total[5m])) by (instance) +
sum(rate(node_disk_writes_completed_total[5m])) by (instance)
3. Tempo — Traces
Grafana Tempo adalah distributed tracing backend yang hemat cost — hanya memerlukan object storage (S3, GCS, Azure Blob) dan tidak memerlukan indexing yang mahal.
# Install Tempo via Helm
helm install tempo grafana/tempo-distributed \
--namespace monitoring \
--set storage.trace.backend=s3 \
--set storage.trace.s3.bucket=tempo-traces \
--set storage.trace.s3.endpoint=s3.ap-southeast-1.amazonaws.com
# OTel Collector configuration untuk mengirim traces ke Tempo
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
jaeger:
protocols:
grpc:
endpoint: 0.0.0.0:14250
thrift_http:
endpoint: 0.0.0.0:14268
processors:
batch:
timeout: 5s
send_batch_size: 1000
memory_limiter:
check_interval: 1s
limit_mib: 512
spike_limit_mib: 128
exporters:
otlp/tempo:
endpoint: tempo-distributor.monitoring:4317
tls:
insecure: true
loki:
endpoint: http://loki-gateway.monitoring/loki/api/v1/push
service:
pipelines:
traces:
receivers: [otlp, jaeger]
processors: [memory_limiter, batch]
exporters: [otlp/tempo]
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [loki]
# TraceQL queries di Grafana Explore
# Cari traces berdasarkan service
{resource.service.name="user-service"}
# Cari traces dengan error
{resource.service.name="user-service" && status=error}
# Cari traces dengan latency tinggi
{duration > 1s}
# Cari span dengan atribut tertentu
{resource.service.name="api-gateway" && span.http.method="POST"}
4. Loki — Logs
Grafana Loki adalah log aggregation system yang hemat cost — hanya meng-index label, bukan isi log. Ini membuat Loki jauh lebih murah dibandingkan Elasticsearch.
# Install Loki via Helm
helm install loki grafana/loki-distributed \
--namespace monitoring \
--values values-loki.yaml
# values-loki.yaml
loki:
auth_enabled: false
commonConfig:
replication_factor: 1
storage:
type: s3
s3:
endpoint: s3.ap-southeast-1.amazonaws.com
region: ap-southeast-1
bucketnames: loki-logs
schemaConfig:
configs:
- from: "2024-01-01"
store: tsdb
object_store: s3
schema: v13
index:
prefix: loki_index_
period: 24h
# Promtail — agent untuk mengumpulkan logs
# promtail-config.yaml
server:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki-gateway.monitoring/loki/api/v1/push
scrape_configs:
# Kubernetes pod logs
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
target_label: app
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
pipeline_stages:
- cri: {}
- json:
expressions:
level: level
msg: message
- labels:
level:
# LogQL queries
# Semua logs dari service tertentu
{app="user-service"}
# Logs dengan level error
{app="user-service"} |= "error" | json | level="error"
# Parse JSON logs dan filter
{namespace="production"} | json | status >= 500
# Rate logs error per detik
sum(rate({app="user-service"} |= "error" [5m])) by (pod)
# Extract dan filter field
{app="api-gateway"} | json
| method = "POST"
| duration > 1000
| line_format "{{.method}} {{.path}} {{.duration}}ms"
5. Grafana Dashboard
# Datasource provisioning
# provisioning/datasources/datasources.yaml
apiVersion: 1
datasources:
# Mimir (Prometheus compatible)
- name: Mimir
type: prometheus
url: http://mimir-nginx.monitoring:80/prometheus
access: proxy
isDefault: true
jsonData:
httpMethod: POST
exemplarTraceIdDestinations:
- name: traceID
datasourceUid: tempo
# Tempo
- name: Tempo
type: tempo
url: http://tempo-query-frontend.monitoring:3100
access: proxy
uid: tempo
jsonData:
tracesToLogsV2:
datasourceUid: loki
filterByTraceID: true
filterBySpanID: true
tracesToMetrics:
datasourceUid: mimir
nodeGraph:
enabled: true
# Loki
- name: Loki
type: loki
url: http://loki-gateway.monitoring
access: proxy
uid: loki
jsonData:
derivedFields:
- datasourceUid: tempo
matcherRegex: "traceID=(\\w+)"
name: TraceID
url: "$${__value.raw}"
5.1 Dashboard JSON Example
{
"title": "App Overview - Grafana Stack",
"panels": [
{
"title": "Request Rate",
"type": "timeseries",
"targets": [{
"expr": "sum(rate(http_requests_total[5m])) by (service)",
"legendFormat": "{{service}}"
}]
},
{
"title": "Error Rate %",
"type": "gauge",
"targets": [{
"expr": "sum(rate(http_requests_total{status=~'5..'}[5m])) / sum(rate(http_requests_total[5m])) * 100"
}],
"fieldConfig": {
"defaults": {
"thresholds": {
"steps": [
{"value": 0, "color": "green"},
{"value": 1, "color": "yellow"},
{"value": 5, "color": "red"}
]
}
}
}
},
{
"title": "P99 Latency",
"type": "timeseries",
"targets": [{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))"
}]
},
{
"title": "Recent Errors (Logs)",
"type": "logs",
"datasource": "Loki",
"targets": [{
"expr": "{namespace='production'} |= 'error' | json | level='error'"
}]
}
]
}
6. Alerting & Alertmanager
# Grafana alert rule provisioning
# provisioning/alerting/rules.yaml
apiVersion: 1
groups:
- orgId: 1
name: App Health
folder: Production
interval: 1m
rules:
- uid: high-error-rate
title: High Error Rate
condition: C
data:
- refId: A
datasourceUid: mimir
model:
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service) * 100
- refId: C
model:
type: threshold
conditions:
- evaluator:
type: gt
params: [5]
for: 5m
labels:
severity: critical
team: platform
annotations:
summary: "Error rate above 5% for {{ $labels.service }}"
description: "Error rate is {{ $value }}%"
- uid: high-latency
title: High P99 Latency
condition: C
data:
- refId: A
model:
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
for: 10m
labels:
severity: warning
annotations:
summary: "P99 latency above 1s for {{ $labels.service }}"
# Contact points — kirim alert ke Slack
# provisioning/alerting/contactpoints.yaml
apiVersion: 1
contactPoints:
- orgId: 1
name: slack-alerts
receivers:
- uid: slack-prod
type: slack
settings:
url: ${SLACK_WEBHOOK_URL}
title: "{{ .CommonLabels.alertname }}"
text: |
{{ .CommonAnnotations.summary }}
Severity: {{ .CommonLabels.severity }}
7. Explore: Correlating Signals
Keunggulan utama Grafana Stack adalah kemampuan mengkorelasikan ketiga sinyal — dari metric Anda bisa langsung melihat traces, dari trace bisa melihat logs, dan sebaliknya.
7.1 Metric → Trace → Logs Workflow
- Metrics (Explore → Mimir): Anda melihat spike error rate di dashboard
- Exemplars: Klik titik exemplar pada grafik metric → langsung ke trace
- Trace (Explore → Tempo): Lihat waterfall trace — span mana yang lambat/error
- Trace → Logs: Klik span → melihat logs terkait dari Loki
- Logs → Trace: Dari log entry, klik link TraceID → kembali ke trace
Exemplars menghubungkan metric point dengan trace ID spesifik. Konfigurasi exemplars di Prometheus/Mimir dengan exemplar_trace_id_destinations di datasource settings. Ini memungkinkan Anda klik titik di grafik metric dan langsung melihat trace yang menyebabkan anomali tersebut.
8. Production Best Practices
| Praktik | Deskripsi | Alasan |
|---|---|---|
| Retention policy | Metrics 90d, Traces 30d, Logs 14d | Cost optimization |
| Compactor | Aktifkan compactor di Mimir | Kurangi storage usage |
| Multi-tenancy | Pisahkan tenant per team | Isolasi, billing, rate limit |
| Object storage | S3/GCS untuk semua backend | Scalable, murah, durable |
| Cardinality limits | Batasi label cardinality | Prevent memory explosion |
| Recording rules | Pre-compute expensive queries | Dashboard load time |
Hindari menggunakan label dengan high cardinality (user ID, request ID) sebagai metric label. Setiap kombinasi label unik = satu time-series. Jika Anda memiliki 10.000 user × 100 endpoints = 1.000.000 time-series. Ini bisa meledakkan memory Mimir/Prometheus. Gunakan traces untuk high-cardinality data.
9. Quiz: Uji Pemahamanmu!
Setelah membaca tutorial di atas, jawablah 5 pertanyaan berikut: