1. Pengenalan Transformer
Transformer adalah arsitektur deep learning yang diperkenalkan oleh Vaswani et al. dalam makalah bersejarah "Attention Is All You Need" (2017) dari Google. Arsitektur ini menggantikan RNN/LSTM sepenuhnya untuk tugas NLP, dan kini menjadi fondasi dari hampir semua model AI modern β dari BERT hingga ChatGPT, dari DALL-E hingga AlphaFold.
Kunci revolusioner Transformer adalah Attention Mechanism β mekanisme yang memungkinkan model "melihat" semua token dalam sequence sekaligus (paralel) dan menentukan mana yang paling relevan untuk setiap posisi, tanpa harus memproses secara berurutan seperti RNN.
Mengapa Transformer Menggantikan RNN?
RNN/LSTM: TRANSFORMER:
(Sequential Processing) (Parallel Processing)
xβ β hβ β xβ β hβ β xβ β hβ βββββ βββββ βββββ
β xββ β xββ β xββ
Satu per satu! βββ¬ββ βββ¬ββ βββ¬ββ
ββ² β² ββ² β² ββ² β²
Problem: β β² β² β β² β² β β² β²
β’ Vanishing gradient βΌ βΌ βΌ βΌ βΌ βΌ βΌ βΌ βΌ βΌ
β’ Lambat (sequential) βββββββββββββββββββββ
β’ Long dependency βSelf ββSelf ββSelf β
sulit dipelajari βAttn ββAttn ββAttn β
βββββββββββββββββββββ
TRANSFORMER ADVANTAGES:
β
Paralel β semua posisi diproses sekaligus
β
Long-range dependency β attention langsung "melihat" token jauh
β
Scalable β bisa dilatih di GPU/TPU massive
β
Universal β NLP, Vision, Audio, bahkan protein!
Dampak Transformer
| Bidang | Model Transformer | Pencapaian |
|---|---|---|
| NLP | BERT, GPT-4, LLaMA, Gemini | Human-level text understanding & generation |
| Computer Vision | ViT, DeiT, Swin Transformer | Menggantikan CNN di banyak task |
| Audio | Whisper, MusicLM | Speech recognition & music generation |
| Generative AI | DALL-E 3, Stable Diffusion, Sora | Image & video generation dari text |
| Protein | AlphaFold 2 | Protein structure prediction β Nobel Prize! |
| Code | GitHub Copilot, CodeLlama | AI-assisted programming |
| Robotics | RT-2, Gato | Multi-modal robot control |
2. Konsep Attention
Bayangkan kamu membaca kalimat: "Kucing itu duduk di atas tikar karena ia merasa lelah." Kata "ia" merujuk pada "kucing". Attention adalah mekanisme yang memungkinkan model mengetahui bahwa kata "ia" sangat berkaitan dengan "kucing" β bahkan jika keduanya berjauhan dalam kalimat.
Attention dalam Kehidupan Sehari-hari
Kalimat: "Kucing itu duduk di atas tikar karena ia merasa lelah"
Attention weight dari kata "ia":
Kata: Kucing itu duduk di atas tikar karena ia merasa lelah
Weight: 0.65 0.02 0.05 0.01 0.02 0.03 0.04 0.02 0.10 0.06
βββ
Paling tinggi!
Visualisasi (garis tebal = attention tinggi):
"Kucing" βββββββββββββββββββββββββββββββββββββββββ
"itu" ββββββββ β
"duduk" ββββββββββ β
"di" βββββ β
"atas" ββββββ β
"tikar" ββββββββ β
"karena" ββββββββββ β
"ia" ββββββββββββββββββββββββββββββββββββββββββ
"merasa" ββββββββββββββ β
"lelah" ββββββββββββ β
β Model "tahu" bahwa "ia" merujuk ke "Kucing"
Evolusi Attention
| Era | Mekanisme | Penjelasan |
|---|---|---|
| 2014 | Seq2Seq + Attention (Bahdanau) | Attention ditambahkan pada encoder-decoder RNN. Decoder "melihat" encoder hidden states |
| 2016 | Self-Attention (Lin et al.) | Token dalam satu sequence saling "melihat" satu sama lain |
| 2017 | Multi-Head Self-Attention (Transformer) | Multiple attention heads bekerja paralel. Tanpa RNN! |
| 2018+ | Sparse Attention, Linear Attention | Efisiensi: O(nΒ²) β O(n log n) atau O(n) |
3. Self-Attention Mechanism
Self-Attention adalah mekanisme di mana setiap token dalam sequence menghitung skor relevansi dengan semua token lainnya dalam sequence yang sama β termasuk dirinya sendiri. Ini berbeda dari attention biasa (cross-attention) di mana satu sequence attend ke sequence lain.
Query, Key, Value (Q, K, V)
Untuk setiap token, kita buat 3 vektor: Query (Q), Key (K), dan Value (V).
Step-by-step Self-Attention:
INPUT: Embedding matrix X (n_tokens Γ d_model)
ββββββββββββββββββββββββββββββββββββ
β xβ = [0.2, 0.5, -0.1, ...] β Token 1
β xβ = [0.8, -0.3, 0.4, ...] β Token 2
β xβ = [-0.5, 0.1, 0.9, ...] β Token 3
β xβ = [0.3, 0.7, -0.2, ...] β Token 4
ββββββββββββββββββββββββββββββββββββ
Step 1: Hitung Q, K, V (linear projection)
Q = X Γ Wq (query: "apa yang saya cari?")
K = X Γ Wk (key: "apa yang saya tawarkan?")
V = X Γ Wv (value: "informasi sebenarnya")
Step 2: Hitung Attention Scores
scores = Q Γ Kα΅ (dot product Q dan K)
scores = scores / βd_k (scaling: agar tidak terlalu besar)
QΓKα΅ = ββββββββββββββββββββββββ
β sββ sββ sββ sββ β Seberapa relevan
β sββ sββ sββ sββ β setiap pasangan
β sββ sββ sββ sββ β token!
β sββ sββ sββ sββ β
ββββββββββββββββββββββββ
Step 3: Softmax (normalize jadi probabilitas)
attention_weights = softmax(scores)
ββββββββββββββββββββββββββββββββ
β 0.45 0.20 0.10 0.25 β β Token 1 paling "melihat"
β 0.15 0.50 0.25 0.10 β dirinya sendiri (0.45)
β 0.10 0.30 0.35 0.25 β dan Token 2 (0.20)
β 0.20 0.10 0.15 0.55 β
ββββββββββββββββββββββββββββββββ
Step 4: Output = Attention_weights Γ V
output = softmax(QKα΅/βd_k) Γ V
ββββββββββ¬βββββββββ ββ¬β
Attention weights Values
β Setiap output token = weighted sum dari semua value tokens
β Token dengan attention tinggi memberikan kontribusi lebih besar
Mengapa Scaling (βd_k)?
Tanpa scaling, jika dimensi d_k besar, dot product QΒ·K bisa bernilai sangat besar. Softmax pada nilai sangat besar menghasilkan distribusi yang sangat "tajam" (satu posisi mendapat hampir semua bobot, yang lain ~0). Ini menyebabkan gradient yang sangat kecil dan training lambat.
Attention(Q, K, V) = softmax(Q Γ Kα΅ / βd_k) Γ V
di mana d_k = dimensi key vector. Scaling factor βd_k menjaga agar variance tetap ~1.
4. Multi-Head Attention
Alih-alih melakukan satu attention computation, Multi-Head Attention melakukan h attention computations secara paralel (biasanya h = 8 atau 12), masing-masing dengan proyeksi Q, K, V yang berbeda. Hasilnya digabungkan dan diproyeksikan lagi.
Mengapa Multi-Head?
SINGLE HEAD: MULTI-HEAD (h=4):
Attention dari 1 perspektif Attention dari 4 perspektif berbeda!
xβ βββΊ Q,K,V βββΊ Attention βββΊ out xβ βββΊ Qβ,Kβ,Vβ βββΊ Attn 1 βββ
β Qβ,Kβ,Vβ βββΊ Attn 2 βββ€
β Qβ,Kβ,Vβ βββΊ Attn 3 βββ€ Concat β Linear β out
β Qβ,Kβ,Vβ βββΊ Attn 4 βββ
Setiap head "melihat" pola berbeda:
Head 1: Syntactic (subjek β verb)
Head 2: Semantic (synonym relations)
Head 3: Positional (nearby words)
Head 4: Long-range dependencies
Formula:
MultiHead(Q, K, V) = Concat(headβ, ..., headβ) Γ W^O
di mana headα΅’ = Attention(QWα΅’^Q, KWα΅’^K, VWα΅’^V)
d_model = 512, h = 8 β d_k = d_v = 512/8 = 64 per head
5. Positional Encoding
Berbeda dengan RNN yang secara alami menangkap urutan (memproses satu per satu), Transformer memproses semua token secara paralel. Tanpa informasi posisi, model tidak bisa membedakan "Anjing menggigit kucing" dari "Kucing menggigit anjing" β padahal maknanya sangat berbeda!
Sinusoidal Positional Encoding
Transformer asli menggunakan fungsi sin dan cos dengan frekuensi berbeda:
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
import numpy as np
import matplotlib.pyplot as plt
def positional_encoding(max_len, d_model):
"""Sinusoidal Positional Encoding."""
PE = np.zeros((max_len, d_model))
positions = np.arange(max_len).reshape(-1, 1)
# 10000^(2i/d_model) = exp(2i * log(10000) / d_model)
div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))
PE[:, 0::2] = np.sin(positions * div_term) # Even dimensions
PE[:, 1::2] = np.cos(positions * div_term) # Odd dimensions
return PE
# Generate PE
max_len = 100
d_model = 64
pe = positional_encoding(max_len, d_model)
# Visualisasi
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Plot 1: Full PE heatmap
axes[0, 0].imshow(pe.T, aspect='auto', cmap='RdBu')
axes[0, 0].set_xlabel('Position')
axes[0, 0].set_ylabel('Dimension')
axes[0, 0].set_title('Positional Encoding Heatmap')
# Plot 2: Individual dimensions
for dim in [0, 1, 2, 3, 10, 20]:
axes[0, 1].plot(pe[:50, dim], label=f'dim {dim}')
axes[0, 1].set_xlabel('Position')
axes[0, 1].set_ylabel('PE Value')
axes[0, 1].set_title('PE Values across Positions')
axes[0, 1].legend(fontsize=8)
axes[0, 1].grid(True, alpha=0.3)
# Plot 3: Similarity matrix (cosine sim antar posisi)
from numpy.linalg import norm
def cosine_sim(a, b):
return np.dot(a, b) / (norm(a) * norm(b) + 1e-8)
n_pos = 50
sim_matrix = np.zeros((n_pos, n_pos))
for i in range(n_pos):
for j in range(n_pos):
sim_matrix[i, j] = cosine_sim(pe[i], pe[j])
axes[1, 0].imshow(sim_matrix, cmap='viridis')
axes[1, 0].set_xlabel('Position j')
axes[1, 0].set_ylabel('Position i')
axes[1, 0].set_title('Cosine Similarity antar Posisi')
plt.colorbar(axes[1, 0].images[0], ax=axes[1, 0])
# Plot 4: Info
axes[1, 1].text(0.5, 0.5,
f'Positional Encoding\n\n'
f'Max Length: {max_len}\n'
f'Model Dimension: {d_model}\n'
f'Frequency Range: sin & cos\n'
f'Method: Sinusoidal\n\n'
f'Properties:\n'
f'β’ PE(pos+k) bisa direpresentasikan\n'
f' sebagai fungsi linear dari PE(pos)\n'
f'β’ Setiap posisi unik\n'
f'β’ Bisa menangani sequence yang\n'
f' lebih panjang dari training',
ha='center', va='center', fontsize=11,
transform=axes[1, 1].transAxes,
bbox=dict(boxstyle='round', facecolor='lightblue', alpha=0.3))
axes[1, 1].axis('off')
plt.suptitle('Positional Encoding β Transformer', fontsize=14)
plt.tight_layout()
plt.show()
6. Arsitektur Transformer Lengkap
Encoder-Decoder Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β TRANSFORMER β β β β ENCODER (kiri) DECODER (kanan) β β βββββββββββββββ βββββββββββββββββββ β β β Input β β Output β β β β Embedding β β Embedding β β β β + Pos Enc β β + Pos Enc β β β ββββββββ¬βββββββ βββββββββ¬ββββββββββ β β β β β β ββββββββΌβββββββ βββββββββΌββββββββββ β β β Multi-Head β β Masked Multi- β β β β Self-Attn β β Head Self-Attn β β β ββββββββ¬βββββββ βββββββββ¬ββββββββββ β β β + Residual β + Residual β β ββββββββΌβββββββ βββββββββΌββββββββββ β β β Layer Norm β β Layer Norm β β β ββββββββ¬βββββββ βββββββββ¬ββββββββββ β β β β β β ββββββββΌβββββββ βββββββββββββββββΌββββββββββ β β β Feed β β Cross-Attention β β β β Forward β β (Q dari Decoder, β β β β (FFN) β β K,V dari Encoder) β β β ββββββββ¬βββββββ βββββββββββββββββ¬ββββββββββ β β β + Residual β + Residual β β ββββββββΌβββββββ βββββββββΌββββββββββ β β β Layer Norm β β Layer Norm β β β ββββββββ¬βββββββ βββββββββ¬ββββββββββ β β β β β β Γ N layers ββββββΌβββββββββββ β β β β Feed Forward β β β β β (FFN) β β β β ββββββ¬βββββββββββ β β β β + Residual β β β ββββββΌβββββββββββ β β β β Layer Norm β β β β ββββββ¬βββββββββββ β β β β β β β Γ N layers β β β β β β β ββββββΌβββββββββββ β β ββββββββββββββββββββββββΊβ Linear + β β β β Softmax β β β βββββββββββββββββ β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ COMPONENTS: β’ Input Embedding: token β vector (d_model = 512) β’ Positional Encoding: tambah info posisi β’ Multi-Head Attention: self-attention paralel β’ Feed Forward Network: 2 linear layers + ReLU FFN(x) = max(0, xWβ + bβ)Wβ + bβ β’ Layer Normalization: stabilisasi training β’ Residual Connection: x + Sublayer(x) β’ Masked Attention (Decoder): cegah "melihat" token masa depan
Komponen Kunci
| Komponen | Fungsi | Dimensi (d_model=512) |
|---|---|---|
| Input Embedding | Token β dense vector | (batch, seq_len, 512) |
| Positional Encoding | Tambah info urutan posisi | (seq_len, 512) |
| Multi-Head Self-Attention | Setiap token attend ke semua token | 8 heads Γ 64 dim |
| Layer Normalization | Normalisasi aktivasi (stabilkan training) | (batch, seq_len, 512) |
| Feed-Forward Network | Transformasi non-linear per posisi | 512 β 2048 β 512 |
| Residual Connection | Skip connection: x + Sublayer(x) | (batch, seq_len, 512) |
7. BERT: Encoder-Only
BERT (Bidirectional Encoder Representations from Transformers) diperkenalkan oleh Google pada tahun 2018. BERT menggunakan arsitektur encoder-only dari Transformer dan dilatih dengan dua teknik pre-training yang inovatif.
Pre-training Tasks
1. MASKED LANGUAGE MODELING (MLM): Randomly mask 15% token, model harus menebaknya! Input: "The [MASK] sat on the [MASK]" Target: "The CAT sat on the MAT" Model melihat context dari KEDUA arah (bidirectional!) 2. NEXT SENTENCE PREDICTION (NSP): Model menentukan apakah dua kalimat berurutan Input: [CLS] The cat sat. [SEP] It was fluffy. [SEP] Label: IsNext (berurutan) Input: [CLS] The cat sat. [SEP] Stocks went up. [SEP] Label: NotNext (tidak berurutan) BERT VARIANTS: βββββββββββββββ¬ββββββββββββββββ¬βββββββββββ¬βββββββββββ β Model β Layers β Hidden β Params β βββββββββββββββΌββββββββββββββββΌβββββββββββΌβββββββββββ€ β BERT-Base β 12 β 768 β 110M β β BERT-Large β 24 β 1024 β 340M β βββββββββββββββ΄ββββββββββββββββ΄βββββββββββ΄βββββββββββ FINE-TUNING BERT: Pre-trained BERT β tambah task-specific head β fine-tune [CLS] β BERT β [CLS] embedding β Linear β Sentiment (pos/neg) Token β BERT β Token embeddings β NER tags Q + A β BERT β Start/End positions β Question Answering
from transformers import BertTokenizer, BertModel, BertForSequenceClassification
import torch
# =============================================
# 1. LOAD PRE-TRAINED BERT
# =============================================
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
print(f"Model: BERT-base-uncased")
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"Vocabulary size: {tokenizer.vocab_size:,}")
print(f"Max position: {model.config.max_position_embeddings}")
# =============================================
# 2. TOKENIZATION
# =============================================
text = "Transformer is a revolutionary architecture in NLP."
tokens = tokenizer(text, return_tensors='pt', padding=True, truncation=True)
print(f"\nOriginal text: {text}")
print(f"Token IDs: {tokens['input_ids']}")
print(f"Tokens: {tokenizer.convert_ids_to_tokens(tokens['input_ids'][0])}")
print(f"Attention mask: {tokens['attention_mask']}")
# BERT tokenization menggunakan WordPiece:
# "revolutionary" β ["revolution", "##ary"]
# "#" = subword continuation token
# =============================================
# 3. GET EMBEDDINGS
# =============================================
with torch.no_grad():
outputs = model(**tokens)
# outputs.last_hidden_state: (batch, seq_len, hidden_size=768)
# outputs.pooler_output: (batch, hidden_size=768) β [CLS] token
print(f"\nLast hidden state shape: {outputs.last_hidden_state.shape}")
print(f"Pooler output (CLS) shape: {outputs.pooler_output.shape}")
# =============================================
# 4. FINE-TUNE UNTUK SENTIMENT ANALYSIS
# =============================================
from transformers import BertForSequenceClassification
import torch.nn as nn
# Load BERT untuk klasifikasi (2 labels: positive/negative)
model_clf = BertForSequenceClassification.from_pretrained(
'bert-base-uncased',
num_labels=2
)
# Fine-tuning example
texts = ["This movie is amazing!", "I hated this film."]
labels = torch.tensor([1, 0]) # 1=positive, 0=negative
inputs = tokenizer(texts, return_tensors='pt', padding=True, truncation=True)
outputs = model_clf(**inputs, labels=labels)
print(f"\nFine-tuning Loss: {outputs.loss.item():.4f}")
print(f"Logits shape: {outputs.logits.shape}")
# Prediction
predictions = torch.softmax(outputs.logits, dim=1)
print(f"Predictions: {predictions}")
# [positive_prob, negative_prob] untuk setiap teks
8. GPT: Decoder-Only
GPT (Generative Pre-trained Transformer) oleh OpenAI menggunakan arsitektur decoder-only dari Transformer. Berbeda dengan BERT yang bersifat bidirectional, GPT bersifat autoregressive β memprediksi token berikutnya berdasarkan token sebelumnya.
Perbedaan BERT vs GPT
| Aspek | BERT | GPT |
|---|---|---|
| Arsitektur | Encoder-only | Decoder-only |
| Direction | Bidirectional (melihat ke depan & belakang) | Unidirectional (hanya ke belakang) |
| Pre-training | MLM + NSP | Next Token Prediction |
| Masking | Random mask (15% token) | Causal mask (token masa depan tersembunyi) |
| Terbaik untuk | Understanding (klasifikasi, NER, QA) | Generation (teks, kode, cerita) |
| Fine-tuning | Task-specific head | In-context learning (prompt) + RLHF |
| Contoh | BERT, RoBERTa, DeBERTa | GPT-4, LLaMA, Mistral, PaLM |
Evolusi GPT
Evolusi Model GPT (OpenAI): GPT-1 (2018): β’ 117M parameters, 12 layers β’ Pre-training + fine-tuning paradigm β’ Menunjukkan zero-shot capability GPT-2 (2019): β’ 1.5B parameters, 48 layers β’ "Too dangerous to release" β’ Multi-task tanpa fine-tuning GPT-3 (2020): β’ 175B parameters, 96 layers β’ In-context learning: few-shot, one-shot, zero-shot β’ Muncul prompt engineering β’ Biaya training: ~$4.6 juta GPT-3.5 / ChatGPT (2022): β’ Fine-tuned dengan RLHF (Reinforcement Learning from Human Feedback) β’ Conversational AI yang viral β’ 100 juta user dalam 2 bulan! GPT-4 (2023): β’ Multi-modal (text + image input) β’ Massive improvement in reasoning β’ M4 architecture (rumored mixture of experts) GPT-4o (2024): β’ Omni-modal (text, image, audio, video) β’ Real-time voice conversation β’ Faster & cheaper Emergent Abilities (muncul pada scale besar): β’ Chain-of-thought reasoning β’ Code generation β’ Mathematical reasoning β’ In-context learning β’ Instruction following
9. Implementasi Self-Attention PyTorch
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
class ScaledDotProductAttention(nn.Module):
"""Scaled Dot-Product Attention."""
def __init__(self, d_k):
super().__init__()
self.d_k = d_k
def forward(self, Q, K, V, mask=None):
# Q, K, V shape: (batch, heads, seq_len, d_k)
# Step 1: Hitung attention scores
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
# scores shape: (batch, heads, seq_len, seq_len)
# Step 2: Apply mask (untuk decoder - cegah melihat masa depan)
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
# Step 3: Softmax
attn_weights = F.softmax(scores, dim=-1)
# Step 4: Weighted sum
output = torch.matmul(attn_weights, V)
# output shape: (batch, heads, seq_len, d_k)
return output, attn_weights
class MultiHeadAttention(nn.Module):
"""Multi-Head Attention."""
def __init__(self, d_model, n_heads):
super().__init__()
assert d_model % n_heads == 0
self.d_model = d_model
self.n_heads = n_heads
self.d_k = d_model // n_heads # dimensi per head
# Linear projections
self.W_q = nn.Linear(d_model, d_model) # Query projection
self.W_k = nn.Linear(d_model, d_model) # Key projection
self.W_v = nn.Linear(d_model, d_model) # Value projection
self.W_o = nn.Linear(d_model, d_model) # Output projection
self.attention = ScaledDotProductAttention(self.d_k)
def forward(self, Q, K, V, mask=None):
batch_size = Q.size(0)
# 1. Linear projection
Q = self.W_q(Q) # (batch, seq_len, d_model)
K = self.W_k(K)
V = self.W_v(V)
# 2. Split into multiple heads
# (batch, seq_len, d_model) β (batch, n_heads, seq_len, d_k)
Q = Q.view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
K = K.view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
V = V.view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
# 3. Scaled dot-product attention
output, attn_weights = self.attention(Q, K, V, mask)
# 4. Concatenate heads
# (batch, n_heads, seq_len, d_k) β (batch, seq_len, d_model)
output = output.transpose(1, 2).contiguous().view(
batch_size, -1, self.d_model
)
# 5. Final linear projection
output = self.W_o(output)
return output, attn_weights
class TransformerBlock(nn.Module):
"""Single Transformer Encoder Block."""
def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
super().__init__()
# Multi-Head Self-Attention
self.attention = MultiHeadAttention(d_model, n_heads)
# Feed-Forward Network
self.ffn = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(d_ff, d_model)
)
# Layer Normalization
self.ln1 = nn.LayerNorm(d_model)
self.ln2 = nn.LayerNorm(d_model)
# Dropout
self.dropout = nn.Dropout(dropout)
def forward(self, x, mask=None):
# 1. Self-Attention + Residual + LayerNorm
attn_out, attn_weights = self.attention(x, x, x, mask)
x = self.ln1(x + self.dropout(attn_out))
# 2. FFN + Residual + LayerNorm
ffn_out = self.ffn(x)
x = self.ln2(x + self.dropout(ffn_out))
return x, attn_weights
# =============================================
# TEST: Jalankan Multi-Head Attention
# =============================================
print("=" * 60)
print("TESTING MULTI-HEAD SELF-ATTENTION")
print("=" * 60)
# Hyperparameters
d_model = 512
n_heads = 8
d_ff = 2048
seq_len = 10
batch_size = 2
# Create model
block = TransformerBlock(d_model, n_heads, d_ff)
# Random input (simulating token embeddings)
x = torch.randn(batch_size, seq_len, d_model)
# Forward pass
output, attn_weights = block(x)
print(f"Input shape: {x.shape}") # (2, 10, 512)
print(f"Output shape: {output.shape}") # (2, 10, 512)
print(f"Attention weights shape: {attn_weights.shape}") # (2, 8, 10, 10)
# Attention weight visualization
print(f"\nAttention weights (Head 0, Sample 0):")
print(f"Shape: {attn_weights[0, 0].shape} (seq_len Γ seq_len)")
print(f"Row sum: {attn_weights[0, 0].sum(dim=-1).tolist()}") # β 1.0 per row
# Parameter count
params = sum(p.numel() for p in block.parameters())
print(f"\nParameters in Transformer block: {params:,}")
# Multi-head attention shapes summary
print("\n=== Shape Summary ===")
print(f"d_model = {d_model}")
print(f"n_heads = {n_heads}")
print(f"d_k = d_v = {d_model // n_heads}")
print(f"Q, K, V per head: ({seq_len}, {d_model // n_heads})")
print(f"Attention matrix: ({seq_len}, {seq_len})")
print(f"Output per head: ({seq_len}, {d_model // n_heads})")
print(f"Concat output: ({seq_len}, {d_model})")
10. Quiz: Uji Pemahamanmu!
Setelah membaca tutorial di atas, jawablah 5 pertanyaan berikut untuk menguji pemahamanmu tentang Transformer & Attention: