1. Pengenalan ClickHouse
ClickHouse adalah columnar OLAP database management system open-source yang dirancang untuk analytical queries pada volume data yang sangat besar. Dikembangkan oleh Yandex, ClickHouse mampu memproses miliaran baris per detik dan mengembalikan hasil query dalam hitungan milidetik.
Berbeda dengan OLTP database (PostgreSQL, MySQL) yang dioptimalkan untuk operasi row-by-row, ClickHouse menyimpan data secara kolom (columnar) sehingga sangat efisien untuk aggregasi, filtering, dan scan data besar. ClickHouse digunakan oleh perusahaan seperti Uber, Cloudflare, eBay, dan GitLab untuk analytics real-time.
Materialized views
Columnar compression
parallel processing
low-latency response
1.1 Keunggulan ClickHouse
- Kecepatan Luar Biasa: Vectorized query execution dan columnar storage memproses miliaran baris per detik
- Kompresi Data Efisien: Columnar format memungkinkan kompresi 10-100x dibanding row-based storage
- Linear Scalability: Skala horizontal dengan menambah node tanpa perubahan query
- SQL Penuh: Mendukung syntax SQL yang kaya termasuk window functions, CTE, subqueries
- Real-time Ingestion: Data bisa di-query segera setelah di-insert
- Tidak Ada Index Wajib: Bekerja dengan baik bahkan tanpa index spesifik
1.2 Kapan Menggunakan ClickHouse?
| Cocok Untuk | Tidak Cocok Untuk |
|---|---|
| OLAP / analytical queries | OLTP / transaksi kecil-kecil |
| Time-series analytics | Frequent UPDATE/DELETE |
| Log analytics & monitoring | Point lookups (single row by ID) |
| Real-time dashboards | Data yang sering dimodifikasi |
| Data warehousing | Relational data dengan many-to-many joins |
2. Instalasi & Konfigurasi
2.1 Instalasi ClickHouse
# Metode 1: Install dari official repository (Ubuntu/Debian)
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 \
--recv 8919F6BD2B48D754
echo "deb https://packages.clickhouse.com/deb stable main" | \
sudo tee /etc/apt/sources.list.d/clickhouse.list
sudo apt update
sudo apt install -y clickhouse-server clickhouse-client
# Start service
sudo systemctl start clickhouse-server
sudo systemctl enable clickhouse-server
# Metode 2: Docker (cepat untuk testing)
docker run -d --name clickhouse-server \
-p 8123:8123 \
-p 9000:9000 \
-v ~/clickhouse-data:/var/lib/clickhouse \
clickhouse/clickhouse-server:latest
# Verifikasi instalasi
clickhouse-client --query "SELECT version()"
# Koneksi dari luar
clickhouse-client --host localhost --port 9000
2.2 Membuat Database & Tabel Pertama
-- Membuat database
CREATE DATABASE IF NOT EXISTS analytics;
-- Tabel untuk data analytics web
CREATE TABLE analytics.page_views (
timestamp DateTime DEFAULT now(),
user_id UInt64,
page_url String,
referrer String,
country LowCardinality(String),
device_type LowCardinality(String),
session_duration UInt32,
load_time_ms UInt32,
is_bounce UInt8
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (country, user_id, timestamp)
TTL timestamp + INTERVAL 1 YEAR
SETTINGS index_granularity = 8192;
-- Insert data sample
INSERT INTO analytics.page_views
(timestamp, user_id, page_url, referrer, country, device_type,
session_duration, load_time_ms, is_bounce)
VALUES
('2026-06-01 10:00:00', 1001, '/home', 'google.com', 'ID', 'mobile', 120, 350, 0),
('2026-06-01 10:05:00', 1002, '/products', 'facebook.com', 'ID', 'desktop', 45, 200, 1),
('2026-06-01 10:10:00', 1001, '/checkout', 'direct', 'MY', 'mobile', 300, 500, 0),
('2026-06-01 11:00:00', 1003, '/home', 'google.com', 'SG', 'tablet', 200, 280, 0),
('2026-06-01 11:30:00', 1004, '/blog', 'twitter.com', 'ID', 'desktop', 600, 150, 0);
-- Query analytics sederhana
SELECT
country,
device_type,
count() AS total_views,
uniq(user_id) AS unique_users,
avg(session_duration) AS avg_duration,
avg(load_time_ms) AS avg_load_time,
sum(is_bounce) / count() AS bounce_rate
FROM analytics.page_views
GROUP BY country, device_type
ORDER BY total_views DESC;
2.3 Data Types Penting
| Tipe Data | Deskripsi | Cocok Untuk |
|---|---|---|
UInt8/16/32/64 | Integer unsigned | ID, counters, flags |
Float32/64 | Floating point | Harga, measurements |
String | Variable-length text | URL, nama, deskripsi |
LowCardinality(String) | Dictionary-encoded string | Kategori, negara, status |
DateTime/DateTime64 | Timestamp | Waktu event |
Date | Tanggal saja | Partitioning, grouping |
Array(T) | Array dari tipe T | Tags, labels |
Map(K, V) | Key-value pairs | Properties, metadata |
JSON / Object('json') | JSON semi-structured | Data fleksibel |
Gunakan LowCardinality(String) untuk kolom dengan nilai unik yang terbatas (kurang dari 10.000). Ini mengkompresi data secara signifikan dan mempercepat aggregasi hingga 10x dibanding String biasa.
3. MergeTree Engine Family
MergeTree adalah engine utama ClickHouse yang menjadi fondasi dari semua varian engine lainnya. MergeTree menyimpan data dalam parts yang terurut dan secara berkala melakukan background merge untuk mengkonsolidasi parts kecil menjadi parts yang lebih besar.
3.1 Variasi MergeTree
-- 1. MergeTree Standar
CREATE TABLE raw_events (
event_time DateTime,
user_id UInt64,
event_type LowCardinality(String),
properties String
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_time)
ORDER BY (event_type, user_id, event_time);
-- 2. ReplacingMergeTree — deduplication otomatis
-- Baris dengan ORDER BY key yang sama akan di-merge, hanya menyimpan versi terbaru
CREATE TABLE user_profiles (
user_id UInt64,
name String,
email String,
updated_at DateTime
) ENGINE = ReplacingMergeTree(updated_at)
ORDER BY user_id;
-- Insert versi terbaru — duplikat akan hilang saat merge
INSERT INTO user_profiles VALUES (1, 'Budi Updated', 'budi@new.com', now());
-- 3. SummingMergeTree — auto-sum untuk kolom numerik
-- Sangat cocok untuk tabel fact/summary
CREATE TABLE daily_metrics (
date Date,
metric_name LowCardinality(String),
source LowCardinality(String),
value UInt64,
count UInt64
) ENGINE = SummingMergeTree((value, count))
PARTITION BY toYYYYMM(date)
ORDER BY (date, metric_name, source);
-- 4. AggregatingMergeTree — pre-aggregasi kompleks
CREATE TABLE agg_page_views (
date Date,
country LowCardinality(String),
page_url String,
view_count AggregateFunction(count, UInt64),
unique_users AggregateFunction(uniq, UInt64),
avg_duration AggregateFunction(avg, UInt32)
) ENGINE = AggregatingMergeTree()
PARTITION BY toYYYYMM(date)
ORDER BY (date, country, page_url);
-- 5. CollapsingMergeTree — soft delete dengan sign column
CREATE TABLE inventory (
product_id UInt64,
warehouse_id UInt32,
quantity UInt32,
sign Int8 -- 1 = baris aktif, -1 = baris dihapus
) ENGINE = CollapsingMergeTree(sign)
ORDER BY (warehouse_id, product_id);
3.2 PARTITION BY & ORDER BY Strategy
-- Partition: gunakan granularity yang tepat
-- - Per hari untuk data sangat besar (>100M rows/day)
-- - Per bulan untuk data sedang
-- - JANGAN per jam (terlalu banyak parts)
-- ORDER BY: kolom yang paling sering di-filter dan di-group
-- Urutan penting! Kolom dengan cardinality rendah dulu
CREATE TABLE events_optimized (
event_date Date,
event_time DateTime,
event_type LowCardinality(String), -- rendah
country LowCardinality(String), -- rendah
user_id UInt64, -- tinggi
page_url String -- tinggi
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_date)
ORDER BY (event_type, country, user_id, event_time);
-- Cek parts untuk tabel
SELECT
partition,
name,
rows,
formatReadableSize(data_compressed_bytes) as compressed,
formatReadableSize(data_uncompressed_bytes) as uncompressed
FROM system.parts
WHERE table = 'events_optimized' AND active = 1
ORDER BY partition;
-- Force merge parts (biasanya otomatis)
OPTIMIZE TABLE events_optimized FINAL;
4. Materialized Views
Materialized Views di ClickHouse bukan sekadar "saved query" seperti di PostgreSQL. Materialized views di ClickHouse bekerja secara incremental — setiap kali data di-insert ke tabel source, transformasi dan aggregasi langsung dijalankan dan hasilnya disimpan ke tabel target. Ini memungkinkan pre-computasi real-time.
4.1 Jenis Materialized Views
-- 1. Materialized View dengan AggregatingMergeTree
-- Target tabel dulu
CREATE TABLE mv_hourly_stats (
hour DateTime,
event_type LowCardinality(String),
country LowCardinality(String),
event_count AggregateFunction(count, UInt64),
unique_users AggregateFunction(uniq, UInt64)
) ENGINE = AggregatingMergeTree()
PARTITION BY toYYYYMM(hour)
ORDER BY (hour, event_type, country);
-- Materialized view
CREATE MATERIALIZED VIEW mv_hourly_stats_mv
TO mv_hourly_stats
AS SELECT
toStartOfHour(event_time) AS hour,
event_type,
country,
countState() AS event_count,
uniqState(user_id) AS unique_users
FROM raw_events
GROUP BY hour, event_type, country;
-- Query hasil pre-aggregasi (gunakan fungsi combinators)
SELECT
hour,
event_type,
countMerge(event_count) AS total_events,
uniqMerge(unique_users) AS unique_users
FROM mv_hourly_stats
GROUP BY hour, event_type
ORDER BY hour DESC;
-- 2. Materialized View dengan SummingMergeTree
CREATE TABLE mv_daily_sums (
date Date,
metric LowCardinality(String),
total_value UInt64
) ENGINE = SummingMergeTree(total_value)
PARTITION BY toYYYYMM(date)
ORDER BY (date, metric);
CREATE MATERIALIZED VIEW mv_daily_sums_mv
TO mv_daily_sums
AS SELECT
toDate(event_time) AS date,
event_type AS metric,
count() AS total_value
FROM raw_events
GROUP BY date, metric;
-- 3. Materialized View untuk Transformasi
CREATE TABLE mv_clean_events (
event_time DateTime,
user_id UInt64,
event_category LowCardinality(String),
region LowCardinality(String)
) ENGINE = MergeTree()
ORDER BY (event_category, region, event_time);
CREATE MATERIALIZED VIEW mv_clean_events_mv
TO mv_clean_events
AS SELECT
event_time,
user_id,
-- Transformasi: memetakan event_type ke kategori
multiIf(
event_type IN ('click', 'tap'), 'interaction',
event_type IN ('view', 'impression'), 'engagement',
event_type IN ('purchase', 'checkout'), 'conversion',
'other'
) AS event_category,
country AS region
FROM raw_events;
4.2 Monitoring Materialized Views
-- Cek status materialized views
SELECT
database,
name,
type,
engine
FROM system.tables
WHERE engine = 'MaterializedView';
-- Cek jumlah insert yang diproses
SELECT
table,
event_time,
rows,
formatReadableSize(bytes_on_disk)
FROM system.parts
WHERE table LIKE 'mv_%' AND active = 1;
-- Error dalam MV (jika ada)
SELECT
query_id,
event_time,
type,
query_duration_ms,
read_rows
FROM system.query_log
WHERE query LIKE '%MATERIALIZED%'
ORDER BY event_time DESC
LIMIT 10;
Materialized views di ClickHouse hanya memproses data baru yang di-insert setelah view dibuat. Data yang sudah ada di tabel source tidak otomatis diproses. Jika perlu memproses data historis, jalankan INSERT INTO SELECT secara manual dari tabel source ke tabel target.
5. Projections
Projections adalah fitur ClickHouse yang memungkinkan menyimpan data dalam beberapa urutan berbeda dalam tabel yang sama. Berbeda dengan materialized views yang membuat tabel terpisah, projections berada di dalam tabel dan otomatis digunakan oleh query optimizer.
5.1 Membuat & Menggunakan Projections
-- Tabel dengan projection
CREATE TABLE events (
event_time DateTime,
user_id UInt64,
event_type LowCardinality(String),
country LowCardinality(String),
page_url String,
duration UInt32
) ENGINE = MergeTree()
ORDER BY (event_time, user_id)
PROJECTION proj_by_user (
SELECT * ORDER BY (user_id, event_time)
),
PROJECTION proj_by_country (
SELECT
country,
toStartOfDay(event_time) AS day,
count() AS cnt,
avg(duration) AS avg_dur
GROUP BY country, day
);
-- Insert data
INSERT INTO events SELECT
now() - toIntervalMinute(rand() % 10000),
rand() % 10000,
arrayElement(['click', 'view', 'purchase', 'scroll'], (rand() % 4) + 1),
arrayElement(['ID', 'MY', 'SG', 'TH', 'VN'], (rand() % 5) + 1),
concat('/page/', toString(rand() % 100)),
rand() % 600
FROM numbers(1000000);
-- Query yang akan menggunakan projection otomatis
-- Query optimizer memilih projection yang paling efisien
SELECT * FROM events WHERE user_id = 42 ORDER BY event_time;
-- → Menggunakan proj_by_user (sudah terurut berdasarkan user_id, event_time)
SELECT country, count(), avg(duration) FROM events GROUP BY country;
-- → Menggunakan proj_by_country (sudah pre-aggregasi)
-- Cek projection yang aktif
SELECT
projection_name,
formatReadableSize(bytes_on_disk) AS size,
rows
FROM system.parts
WHERE table = 'events' AND projection_name != ''
ORDER BY projection_name;
-- Force rebuild projection
ALTER TABLE events MATERIALIZE PROJECTION proj_by_country;
Projections cocok ketika Anda punya beberapa pola query berbeda yang membutuhkan ordering berbeda. Namun projections menambah storage dan waktu insert. Untuk sekadar mempercepat filter pada kolom tertentu, gunakan skip index (bloom filter, minmax, set) yang lebih ringan.
5.2 Skip Index (Data Skipping Index)
-- Bloom filter untuk kolom yang sering di-filter
ALTER TABLE events ADD INDEX idx_page_url (page_url)
TYPE bloom_filter GRANULARITY 3;
-- MinMax index untuk range queries
ALTER TABLE events ADD INDEX idx_duration (duration)
TYPE minmax GRANULARITY 3;
-- Set index untuk kolom low cardinality
ALTER TABLE events ADD INDEX idx_country (country)
TYPE set(10) GRANULARITY 3;
-- n-gram bloom filter untuk substring search
ALTER TABLE events ADD INDEX idx_url_ngram (page_url)
TYPE ngrambf_v1(3, 512, 2, 0) GRANULARITY 3;
-- Materialize index yang sudah ditambahkan
ALTER TABLE events MATERIALIZE INDEX idx_page_url;
ALTER TABLE events MATERIALIZE INDEX idx_duration;
-- Cek index
SELECT
name,
type,
expr,
granularity_in_blocks
FROM system.data_skipping_indices
WHERE table = 'events';
6. Distributed Tables & Clustering
ClickHouse bisa di-deploy sebagai cluster untuk mendistribusikan data dan query ke beberapa node. Setiap node menyimpan sebagian data, dan Distributed engine menyediakan unified view untuk query across seluruh cluster.
6.1 Setup Cluster
<clickhouse>
<remote_servers>
<analytics_cluster>
<shard>
<internal_replication>true</internal_replication>
<replica>
<host>clickhouse-1</host>
<port>9000</port>
</replica>
<replica>
<host>clickhouse-2</host>
<port>9000</port>
</replica>
</shard>
<shard>
<internal_replication>true</internal_replication>
<replica>
<host>clickhouse-3</host>
<port>9000</port>
</replica>
<replica>
<host>clickhouse-4</host>
<port>9000</port>
</replica>
</shard>
</analytics_cluster>
</remote_servers>
</clickhouse>
6.2 Distributed Tables
-- Di setiap node, buat tabel local
CREATE TABLE events_local ON CLUSTER analytics_cluster (
event_time DateTime,
user_id UInt64,
event_type LowCardinality(String),
country LowCardinality(String),
duration UInt32
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/events', '{replica}')
PARTITION BY toYYYYMM(event_time)
ORDER BY (country, user_id, event_time);
-- Buat distributed table (bisa di satu node saja)
CREATE TABLE events_distributed ON CLUSTER analytics_cluster AS events_local
ENGINE = Distributed(analytics_cluster, currentDatabase(), events_local, cityHash64(user_id));
-- INSERT ke distributed table — data otomatis di-route ke shard yang benar
INSERT INTO events_distributed VALUES
(now(), 1001, 'click', 'ID', 100),
(now(), 1002, 'view', 'MY', 50);
-- SELECT dari distributed table — query di-parallel-kan ke semua shard
SELECT
country,
count() AS events,
uniq(user_id) AS users,
avg(duration) AS avg_duration
FROM events_distributed
GROUP BY country;
-- Preferring local data (menghindari network transfer)
SELECT * FROM events_local WHERE country = 'ID';
-- Cluster info
SELECT * FROM system.clusters WHERE cluster = 'analytics_cluster';
7. Query Optimization & Fungsi Lanjutan
7.1 Window Functions
-- Running total dan ranking
SELECT
user_id,
event_time,
event_type,
duration,
-- Running sum per user
sum(duration) OVER (
PARTITION BY user_id
ORDER BY event_time
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS cumulative_duration,
-- Ranking berdasarkan duration
rank() OVER (
PARTITION BY event_type
ORDER BY duration DESC
) AS duration_rank,
-- Moving average
avg(duration) OVER (
ORDER BY event_time
ROWS BETWEEN 4 PRECEDING AND CURRENT ROW
) AS moving_avg_5
FROM events
ORDER BY user_id, event_time;
-- Lag/Lead untuk analisis sekuensial
SELECT
user_id,
event_time,
event_type,
lagInFrame(event_time) OVER (PARTITION BY user_id ORDER BY event_time) AS prev_event,
leadInFrame(event_type) OVER (PARTITION BY user_id ORDER BY event_time) AS next_event,
dateDiff('second',
lagInFrame(event_time) OVER (PARTITION BY user_id ORDER BY event_time),
event_time
) AS seconds_since_prev
FROM events;
7.2 Fungsi Agregasi Khas ClickHouse
-- argMax / argMin: mengambil nilai kolom lain saat max/min tercapai
SELECT
user_id,
argMax(event_time, duration) AS longest_session_time,
argMax(page_url, duration) AS longest_session_page,
max(duration) AS max_duration
FROM events
GROUP BY user_id;
-- quantiles: percentiles untuk distribusi
SELECT
event_type,
count() AS total,
quantile(0.5)(duration) AS p50,
quantile(0.95)(duration) AS p95,
quantile(0.99)(duration) AS p99
FROM events
GROUP BY event_type;
-- uniqExact vs uniq: precision vs speed
SELECT
country,
uniq(user_id) AS approx_users, -- HyperLogLog, cepat
uniqExact(user_id) AS exact_users -- Tepat, lebih lambat
FROM events
GROUP BY country;
-- topK: N nilai paling umum
SELECT
topK(5)(page_url) AS top_pages,
topK(3)(country) AS top_countries
FROM events;
-- histogram: distribusi frekuensi
SELECT
event_type,
histogram(20)(duration) AS duration_distribution
FROM events
GROUP BY event_type;
8. Best Practices & Monitoring
8.1 Query Performance Tips
| Tip | Penjelasan |
|---|---|
| Gunakan PARTITION BY | Filter berdasarkan partisi (tanggal) untuk skip data yang tidak relevan |
| ORDER BY yang tepat | Kolom yang sering di-ORDER BY / GROUP BY harus di awal ORDER BY clause |
| LowCardinality | Gunakan untuk kolom dengan <10.000 nilai unik |
| Materialized Views | Pre-aggregasi untuk query yang sering dijalankan |
| Avoid SELECT * | Pilih kolom yang diperlukan — columnar storage akan skip kolom lain |
| Batch INSERT | Insert dalam batch 1000-100.000 baris, bukan satu per satu |
| EXPLAIN PIPELINE | Gunakan untuk memahami eksekusi query |
8.2 Monitoring Cluster
-- Query yang sedang running
SELECT
query_id,
user,
elapsed,
formatReadableSize(memory_usage) AS mem,
formatReadableSize(read_bytes) AS read,
query
FROM system.processes;
-- Top queries by duration
SELECT
query_id,
user,
query_duration_ms,
formatReadableSize(read_bytes) AS read_size,
formatReadableSize(result_bytes) AS result_size,
substring(query, 1, 100) AS query_preview
FROM system.query_log
WHERE type = 'QueryFinish'
ORDER BY query_duration_ms DESC
LIMIT 10;
-- Storage usage per tabel
SELECT
database,
table,
formatReadableSize(sum(bytes_on_disk)) AS disk_size,
sum(rows) AS total_rows,
count() AS part_count
FROM system.parts
WHERE active = 1
GROUP BY database, table
ORDER BY sum(bytes_on_disk) DESC;
-- Replication status (untuk replicated tables)
SELECT
database,
table,
is_leader,
absolute_delay,
queue_size
FROM system.replicas;
9. Quiz: Uji Pemahamanmu!
Setelah membaca tutorial di atas, jawablah 5 pertanyaan berikut untuk menguji pemahamanmu tentang ClickHouse: