Transformer & Attention Mechanism

📋 Daftar Isi

Pengenalan Transformer
Konsep Attention
Self-Attention Mechanism
Multi-Head Attention
Positional Encoding
Arsitektur Transformer Lengkap
BERT: Encoder-Only
GPT: Decoder-Only
Implementasi Self-Attention PyTorch
Quiz Pemahaman

1. Pengenalan Transformer

Transformer adalah arsitektur deep learning yang diperkenalkan oleh Vaswani et al. dalam makalah bersejarah "Attention Is All You Need" (2017) dari Google. Arsitektur ini menggantikan RNN/LSTM sepenuhnya untuk tugas NLP, dan kini menjadi fondasi dari hampir semua model AI modern — dari BERT hingga ChatGPT, dari DALL-E hingga AlphaFold.

Kunci revolusioner Transformer adalah Attention Mechanism — mekanisme yang memungkinkan model "melihat" semua token dalam sequence sekaligus (paralel) dan menentukan mana yang paling relevan untuk setiap posisi, tanpa harus memproses secara berurutan seperti RNN.

Mengapa Transformer Menggantikan RNN?

Diagram: RNN vs Transformer

RNN/LSTM:                              TRANSFORMER:
(Sequential Processing)                 (Parallel Processing)

x₁ → h₁ → x₂ → h₂ → x₃ → h₃        ┌───┐   ┌───┐   ┌───┐
                                        │ x₁│   │ x₂│   │ x₃│
Satu per satu!                          └─┬─┘   └─┬─┘   └─┬─┘
                                         │╲ ╲   │╲ ╲   │╲ ╲
Problem:                                 │ ╲ ╲  │ ╲ ╲  │ ╲ ╲
• Vanishing gradient                     ▼  ▼ ▼ ▼  ▼ ▼ ▼  ▼ ▼ ▼
• Lambat (sequential)                   ┌─────┐┌─────┐┌─────┐
• Long dependency                        │Self ││Self ││Self │
  sulit dipelajari                       │Attn ││Attn ││Attn │
                                         └─────┘└─────┘└─────┘

TRANSFORMER ADVANTAGES:
✅ Paralel — semua posisi diproses sekaligus
✅ Long-range dependency — attention langsung "melihat" token jauh
✅ Scalable — bisa dilatih di GPU/TPU massive
✅ Universal — NLP, Vision, Audio, bahkan protein!

Dampak Transformer

Bidang	Model Transformer	Pencapaian
NLP	BERT, GPT-4, LLaMA, Gemini	Human-level text understanding & generation
Computer Vision	ViT, DeiT, Swin Transformer	Menggantikan CNN di banyak task
Audio	Whisper, MusicLM	Speech recognition & music generation
Generative AI	DALL-E 3, Stable Diffusion, Sora	Image & video generation dari text
Protein	AlphaFold 2	Protein structure prediction → Nobel Prize!
Code	GitHub Copilot, CodeLlama	AI-assisted programming
Robotics	RT-2, Gato	Multi-modal robot control

2. Konsep Attention

Bayangkan kamu membaca kalimat: "Kucing itu duduk di atas tikar karena ia merasa lelah." Kata "ia" merujuk pada "kucing". Attention adalah mekanisme yang memungkinkan model mengetahui bahwa kata "ia" sangat berkaitan dengan "kucing" — bahkan jika keduanya berjauhan dalam kalimat.

Attention dalam Kehidupan Sehari-hari

Diagram: Attention Weight Visualization

Kalimat: "Kucing itu duduk di atas tikar karena ia merasa lelah"

Attention weight dari kata "ia":

Kata:     Kucing  itu  duduk  di  atas  tikar  karena  ia   merasa  lelah
Weight:    0.65   0.02  0.05  0.01 0.02  0.03   0.04  0.02  0.10   0.06
           ↑↑↑
           Paling tinggi!
           
Visualisasi (garis tebal = attention tinggi):

"Kucing" ════════════════════════════════════════╗
"itu"    ────────                                 ║
"duduk"  ──────────                               ║
"di"     ─────                                    ║
"atas"   ──────                                   ║
"tikar"  ────────                                 ║
"karena" ──────────                               ║
"ia"     ─────────────────────────────────────────╝
"merasa" ──────────────                           ║
"lelah"  ────────────                             ║

→ Model "tahu" bahwa "ia" merujuk ke "Kucing"

Evolusi Attention

Era	Mekanisme	Penjelasan
2014	Seq2Seq + Attention (Bahdanau)	Attention ditambahkan pada encoder-decoder RNN. Decoder "melihat" encoder hidden states
2016	Self-Attention (Lin et al.)	Token dalam satu sequence saling "melihat" satu sama lain
2017	Multi-Head Self-Attention (Transformer)	Multiple attention heads bekerja paralel. Tanpa RNN!
2018+	Sparse Attention, Linear Attention	Efisiensi: O(n²) → O(n log n) atau O(n)

3. Self-Attention Mechanism

Self-Attention adalah mekanisme di mana setiap token dalam sequence menghitung skor relevansi dengan semua token lainnya dalam sequence yang sama — termasuk dirinya sendiri. Ini berbeda dari attention biasa (cross-attention) di mana satu sequence attend ke sequence lain.

Query, Key, Value (Q, K, V)

Untuk setiap token, kita buat 3 vektor: Query (Q), Key (K), dan Value (V).

Diagram: Scaled Dot-Product Attention

Step-by-step Self-Attention:

INPUT: Embedding matrix X (n_tokens × d_model)
       ┌──────────────────────────────────┐
       │  x₁ = [0.2, 0.5, -0.1, ...]     │  Token 1
       │  x₂ = [0.8, -0.3, 0.4, ...]     │  Token 2
       │  x₃ = [-0.5, 0.1, 0.9, ...]     │  Token 3
       │  x₄ = [0.3, 0.7, -0.2, ...]     │  Token 4
       └──────────────────────────────────┘

Step 1: Hitung Q, K, V (linear projection)
  Q = X × Wq    (query: "apa yang saya cari?")
  K = X × Wk    (key:   "apa yang saya tawarkan?")
  V = X × Wv    (value: "informasi sebenarnya")

Step 2: Hitung Attention Scores
  scores = Q × Kᵀ          (dot product Q dan K)
  scores = scores / √d_k    (scaling: agar tidak terlalu besar)

  Q×Kᵀ = ┌──────────────────────┐
          │ s₁₁  s₁₂  s₁₃  s₁₄ │  Seberapa relevan
          │ s₂₁  s₂₂  s₂₃  s₂₄ │  setiap pasangan
          │ s₃₁  s₃₂  s₃₃  s₃₄ │  token!
          │ s₄₁  s₄₂  s₄₃  s₄₄ │
          └──────────────────────┘

Step 3: Softmax (normalize jadi probabilitas)
  attention_weights = softmax(scores)

  ┌──────────────────────────────┐
  │ 0.45  0.20  0.10  0.25      │  ← Token 1 paling "melihat"
  │ 0.15  0.50  0.25  0.10      │     dirinya sendiri (0.45)
  │ 0.10  0.30  0.35  0.25      │     dan Token 2 (0.20)
  │ 0.20  0.10  0.15  0.55      │
  └──────────────────────────────┘

Step 4: Output = Attention_weights × V
  output = softmax(QKᵀ/√d_k) × V
           └────────┬────────┘   └┬┘
           Attention weights     Values

  → Setiap output token = weighted sum dari semua value tokens
  → Token dengan attention tinggi memberikan kontribusi lebih besar

Mengapa Scaling (√d_k)?

Tanpa scaling, jika dimensi d_k besar, dot product Q·K bisa bernilai sangat besar. Softmax pada nilai sangat besar menghasilkan distribusi yang sangat "tajam" (satu posisi mendapat hampir semua bobot, yang lain ~0). Ini menyebabkan gradient yang sangat kecil dan training lambat.

📐 Formula Scaled Dot-Product Attention

Attention(Q, K, V) = softmax(Q × Kᵀ / √d_k) × V

di mana d_k = dimensi key vector. Scaling factor √d_k menjaga agar variance tetap ~1.

4. Multi-Head Attention

Alih-alih melakukan satu attention computation, Multi-Head Attention melakukan h attention computations secara paralel (biasanya h = 8 atau 12), masing-masing dengan proyeksi Q, K, V yang berbeda. Hasilnya digabungkan dan diproyeksikan lagi.

Mengapa Multi-Head?

Diagram: Multi-Head Attention

SINGLE HEAD:                            MULTI-HEAD (h=4):

Attention dari 1 perspektif             Attention dari 4 perspektif berbeda!

x₁ ──► Q,K,V ──► Attention ──► out     x₁ ──► Q₁,K₁,V₁ ──► Attn 1 ──┐
                                     │       Q₂,K₂,V₂ ──► Attn 2 ──┤
                                     │       Q₃,K₃,V₃ ──► Attn 3 ──┤ Concat → Linear → out
                                     │       Q₄,K₄,V₄ ──► Attn 4 ──┘

Setiap head "melihat" pola berbeda:
  Head 1: Syntactic (subjek ↔ verb)
  Head 2: Semantic (synonym relations)
  Head 3: Positional (nearby words)
  Head 4: Long-range dependencies

Formula:
  MultiHead(Q, K, V) = Concat(head₁, ..., headₕ) × W^O
  
  di mana headᵢ = Attention(QWᵢ^Q, KWᵢ^K, VWᵢ^V)
  
  d_model = 512, h = 8 → d_k = d_v = 512/8 = 64 per head

5. Positional Encoding

Berbeda dengan RNN yang secara alami menangkap urutan (memproses satu per satu), Transformer memproses semua token secara paralel. Tanpa informasi posisi, model tidak bisa membedakan "Anjing menggigit kucing" dari "Kucing menggigit anjing" — padahal maknanya sangat berbeda!

Sinusoidal Positional Encoding

Transformer asli menggunakan fungsi sin dan cos dengan frekuensi berbeda:

PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Python — Positional Encoding

import numpy as np
import matplotlib.pyplot as plt

def positional_encoding(max_len, d_model):
    """Sinusoidal Positional Encoding."""
    PE = np.zeros((max_len, d_model))
    positions = np.arange(max_len).reshape(-1, 1)
    
    # 10000^(2i/d_model) = exp(2i * log(10000) / d_model)
    div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))
    
    PE[:, 0::2] = np.sin(positions * div_term)  # Even dimensions
    PE[:, 1::2] = np.cos(positions * div_term)  # Odd dimensions
    
    return PE

# Generate PE
max_len = 100
d_model = 64
pe = positional_encoding(max_len, d_model)

# Visualisasi
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Plot 1: Full PE heatmap
axes[0, 0].imshow(pe.T, aspect='auto', cmap='RdBu')
axes[0, 0].set_xlabel('Position')
axes[0, 0].set_ylabel('Dimension')
axes[0, 0].set_title('Positional Encoding Heatmap')

# Plot 2: Individual dimensions
for dim in [0, 1, 2, 3, 10, 20]:
    axes[0, 1].plot(pe[:50, dim], label=f'dim {dim}')
axes[0, 1].set_xlabel('Position')
axes[0, 1].set_ylabel('PE Value')
axes[0, 1].set_title('PE Values across Positions')
axes[0, 1].legend(fontsize=8)
axes[0, 1].grid(True, alpha=0.3)

# Plot 3: Similarity matrix (cosine sim antar posisi)
from numpy.linalg import norm
def cosine_sim(a, b):
    return np.dot(a, b) / (norm(a) * norm(b) + 1e-8)

n_pos = 50
sim_matrix = np.zeros((n_pos, n_pos))
for i in range(n_pos):
    for j in range(n_pos):
        sim_matrix[i, j] = cosine_sim(pe[i], pe[j])

axes[1, 0].imshow(sim_matrix, cmap='viridis')
axes[1, 0].set_xlabel('Position j')
axes[1, 0].set_ylabel('Position i')
axes[1, 0].set_title('Cosine Similarity antar Posisi')
plt.colorbar(axes[1, 0].images[0], ax=axes[1, 0])

# Plot 4: Info
axes[1, 1].text(0.5, 0.5, 
    f'Positional Encoding\n\n'
    f'Max Length: {max_len}\n'
    f'Model Dimension: {d_model}\n'
    f'Frequency Range: sin & cos\n'
    f'Method: Sinusoidal\n\n'
    f'Properties:\n'
    f'• PE(pos+k) bisa direpresentasikan\n'
    f'  sebagai fungsi linear dari PE(pos)\n'
    f'• Setiap posisi unik\n'
    f'• Bisa menangani sequence yang\n'
    f'  lebih panjang dari training',
    ha='center', va='center', fontsize=11,
    transform=axes[1, 1].transAxes,
    bbox=dict(boxstyle='round', facecolor='lightblue', alpha=0.3))
axes[1, 1].axis('off')

plt.suptitle('Positional Encoding — Transformer', fontsize=14)
plt.tight_layout()
plt.show()

6. Arsitektur Transformer Lengkap

Encoder-Decoder Architecture

Diagram: Arsitektur Transformer (Vaswani et al. 2017)

┌─────────────────────────────────────────────────────────────┐
│                     TRANSFORMER                              │
│                                                              │
│  ENCODER (kiri)              DECODER (kanan)                │
│  ┌─────────────┐             ┌─────────────────┐            │
│  │ Input       │             │ Output          │            │
│  │ Embedding   │             │ Embedding       │            │
│  │ + Pos Enc   │             │ + Pos Enc       │            │
│  └──────┬──────┘             └───────┬─────────┘            │
│         │                            │                       │
│  ┌──────▼──────┐             ┌───────▼─────────┐            │
│  │ Multi-Head  │             │ Masked Multi-   │            │
│  │ Self-Attn   │             │ Head Self-Attn  │            │
│  └──────┬──────┘             └───────┬─────────┘            │
│         │ + Residual                 │ + Residual            │
│  ┌──────▼──────┐             ┌───────▼─────────┐            │
│  │ Layer Norm  │             │ Layer Norm      │            │
│  └──────┬──────┘             └───────┬─────────┘            │
│         │                            │                       │
│  ┌──────▼──────┐     ┌───────────────▼─────────┐            │
│  │ Feed        │     │ Cross-Attention          │            │
│  │ Forward     │     │ (Q dari Decoder,         │            │
│  │ (FFN)       │     │  K,V dari Encoder)       │            │
│  └──────┬──────┘     └───────────────┬─────────┘            │
│         │ + Residual                  │ + Residual           │
│  ┌──────▼──────┐             ┌───────▼─────────┐            │
│  │ Layer Norm  │             │ Layer Norm      │            │
│  └──────┬──────┘             └───────┬─────────┘            │
│         │                            │                       │
│    × N layers                   ┌────▼──────────┐            │
│         │                       │ Feed Forward  │            │
│         │                       │ (FFN)         │            │
│         │                       └────┬──────────┘            │
│         │                            │ + Residual            │
│         │                       ┌────▼──────────┐            │
│         │                       │ Layer Norm    │            │
│         │                       └────┬──────────┘            │
│         │                            │                       │
│         │                       × N layers                  │
│         │                            │                       │
│         │                       ┌────▼──────────┐            │
│         └──────────────────────►│ Linear +      │            │
│                                 │ Softmax       │            │
│                                 └───────────────┘            │
└─────────────────────────────────────────────────────────────┘

COMPONENTS:
• Input Embedding: token → vector (d_model = 512)
• Positional Encoding: tambah info posisi
• Multi-Head Attention: self-attention paralel
• Feed Forward Network: 2 linear layers + ReLU
  FFN(x) = max(0, xW₁ + b₁)W₂ + b₂
• Layer Normalization: stabilisasi training
• Residual Connection: x + Sublayer(x)
• Masked Attention (Decoder): cegah "melihat" token masa depan

Komponen Kunci

Komponen	Fungsi	Dimensi (d_model=512)
Input Embedding	Token → dense vector	(batch, seq_len, 512)
Positional Encoding	Tambah info urutan posisi	(seq_len, 512)
Multi-Head Self-Attention	Setiap token attend ke semua token	8 heads × 64 dim
Layer Normalization	Normalisasi aktivasi (stabilkan training)	(batch, seq_len, 512)
Feed-Forward Network	Transformasi non-linear per posisi	512 → 2048 → 512
Residual Connection	Skip connection: x + Sublayer(x)	(batch, seq_len, 512)

7. BERT: Encoder-Only

BERT (Bidirectional Encoder Representations from Transformers) diperkenalkan oleh Google pada tahun 2018. BERT menggunakan arsitektur encoder-only dari Transformer dan dilatih dengan dua teknik pre-training yang inovatif.

Pre-training Tasks

Diagram: BERT Pre-training Tasks

1. MASKED LANGUAGE MODELING (MLM):
   Randomly mask 15% token, model harus menebaknya!

   Input:    "The [MASK] sat on the [MASK]"
   Target:   "The CAT sat on the MAT"
   
   Model melihat context dari KEDUA arah (bidirectional!)

2. NEXT SENTENCE PREDICTION (NSP):
   Model menentukan apakah dua kalimat berurutan

   Input:    [CLS] The cat sat. [SEP] It was fluffy. [SEP]
   Label:    IsNext (berurutan)

   Input:    [CLS] The cat sat. [SEP] Stocks went up. [SEP]
   Label:    NotNext (tidak berurutan)

BERT VARIANTS:
┌─────────────┬───────────────┬──────────┬──────────┐
│ Model       │ Layers        │ Hidden   │ Params   │
├─────────────┼───────────────┼──────────┼──────────┤
│ BERT-Base   │ 12            │ 768      │ 110M     │
│ BERT-Large  │ 24            │ 1024     │ 340M     │
└─────────────┴───────────────┴──────────┴──────────┘

FINE-TUNING BERT:
Pre-trained BERT → tambah task-specific head → fine-tune

[CLS] → BERT → [CLS] embedding → Linear → Sentiment (pos/neg)
Token → BERT → Token embeddings → NER tags
Q + A → BERT → Start/End positions → Question Answering

Python — BERT dengan Hugging Face

from transformers import BertTokenizer, BertModel, BertForSequenceClassification
import torch

# =============================================
# 1. LOAD PRE-TRAINED BERT
# =============================================
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

print(f"Model: BERT-base-uncased")
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"Vocabulary size: {tokenizer.vocab_size:,}")
print(f"Max position: {model.config.max_position_embeddings}")

# =============================================
# 2. TOKENIZATION
# =============================================
text = "Transformer is a revolutionary architecture in NLP."
tokens = tokenizer(text, return_tensors='pt', padding=True, truncation=True)

print(f"\nOriginal text: {text}")
print(f"Token IDs: {tokens['input_ids']}")
print(f"Tokens: {tokenizer.convert_ids_to_tokens(tokens['input_ids'][0])}")
print(f"Attention mask: {tokens['attention_mask']}")

# BERT tokenization menggunakan WordPiece:
# "revolutionary" → ["revolution", "##ary"]
# "#" = subword continuation token

# =============================================
# 3. GET EMBEDDINGS
# =============================================
with torch.no_grad():
    outputs = model(**tokens)

# outputs.last_hidden_state: (batch, seq_len, hidden_size=768)
# outputs.pooler_output: (batch, hidden_size=768) — [CLS] token

print(f"\nLast hidden state shape: {outputs.last_hidden_state.shape}")
print(f"Pooler output (CLS) shape: {outputs.pooler_output.shape}")

# =============================================
# 4. FINE-TUNE UNTUK SENTIMENT ANALYSIS
# =============================================
from transformers import BertForSequenceClassification
import torch.nn as nn

# Load BERT untuk klasifikasi (2 labels: positive/negative)
model_clf = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=2
)

# Fine-tuning example
texts = ["This movie is amazing!", "I hated this film."]
labels = torch.tensor([1, 0])  # 1=positive, 0=negative

inputs = tokenizer(texts, return_tensors='pt', padding=True, truncation=True)
outputs = model_clf(**inputs, labels=labels)

print(f"\nFine-tuning Loss: {outputs.loss.item():.4f}")
print(f"Logits shape: {outputs.logits.shape}")

# Prediction
predictions = torch.softmax(outputs.logits, dim=1)
print(f"Predictions: {predictions}")
# [positive_prob, negative_prob] untuk setiap teks

8. GPT: Decoder-Only

GPT (Generative Pre-trained Transformer) oleh OpenAI menggunakan arsitektur decoder-only dari Transformer. Berbeda dengan BERT yang bersifat bidirectional, GPT bersifat autoregressive — memprediksi token berikutnya berdasarkan token sebelumnya.

Perbedaan BERT vs GPT

Aspek	BERT	GPT
Arsitektur	Encoder-only	Decoder-only
Direction	Bidirectional (melihat ke depan & belakang)	Unidirectional (hanya ke belakang)
Pre-training	MLM + NSP	Next Token Prediction
Masking	Random mask (15% token)	Causal mask (token masa depan tersembunyi)
Terbaik untuk	Understanding (klasifikasi, NER, QA)	Generation (teks, kode, cerita)
Fine-tuning	Task-specific head	In-context learning (prompt) + RLHF
Contoh	BERT, RoBERTa, DeBERTa	GPT-4, LLaMA, Mistral, PaLM

Evolusi GPT

Diagram: Evolusi GPT Model

Evolusi Model GPT (OpenAI):

GPT-1 (2018):
  • 117M parameters, 12 layers
  • Pre-training + fine-tuning paradigm
  • Menunjukkan zero-shot capability

GPT-2 (2019):
  • 1.5B parameters, 48 layers
  • "Too dangerous to release"
  • Multi-task tanpa fine-tuning

GPT-3 (2020):
  • 175B parameters, 96 layers
  • In-context learning: few-shot, one-shot, zero-shot
  • Muncul prompt engineering
  • Biaya training: ~$4.6 juta

GPT-3.5 / ChatGPT (2022):
  • Fine-tuned dengan RLHF (Reinforcement Learning from Human Feedback)
  • Conversational AI yang viral
  • 100 juta user dalam 2 bulan!

GPT-4 (2023):
  • Multi-modal (text + image input)
  • Massive improvement in reasoning
  • M4 architecture (rumored mixture of experts)

GPT-4o (2024):
  • Omni-modal (text, image, audio, video)
  • Real-time voice conversation
  • Faster & cheaper

Emergent Abilities (muncul pada scale besar):
  • Chain-of-thought reasoning
  • Code generation
  • Mathematical reasoning
  • In-context learning
  • Instruction following

9. Implementasi Self-Attention PyTorch

Python — Multi-Head Self-Attention dari Nol

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class ScaledDotProductAttention(nn.Module):
    """Scaled Dot-Product Attention."""
    def __init__(self, d_k):
        super().__init__()
        self.d_k = d_k
    
    def forward(self, Q, K, V, mask=None):
        # Q, K, V shape: (batch, heads, seq_len, d_k)
        
        # Step 1: Hitung attention scores
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        # scores shape: (batch, heads, seq_len, seq_len)
        
        # Step 2: Apply mask (untuk decoder - cegah melihat masa depan)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        
        # Step 3: Softmax
        attn_weights = F.softmax(scores, dim=-1)
        
        # Step 4: Weighted sum
        output = torch.matmul(attn_weights, V)
        # output shape: (batch, heads, seq_len, d_k)
        
        return output, attn_weights


class MultiHeadAttention(nn.Module):
    """Multi-Head Attention."""
    def __init__(self, d_model, n_heads):
        super().__init__()
        assert d_model % n_heads == 0
        
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads  # dimensi per head
        
        # Linear projections
        self.W_q = nn.Linear(d_model, d_model)  # Query projection
        self.W_k = nn.Linear(d_model, d_model)  # Key projection
        self.W_v = nn.Linear(d_model, d_model)  # Value projection
        self.W_o = nn.Linear(d_model, d_model)  # Output projection
        
        self.attention = ScaledDotProductAttention(self.d_k)
    
    def forward(self, Q, K, V, mask=None):
        batch_size = Q.size(0)
        
        # 1. Linear projection
        Q = self.W_q(Q)  # (batch, seq_len, d_model)
        K = self.W_k(K)
        V = self.W_v(V)
        
        # 2. Split into multiple heads
        # (batch, seq_len, d_model) → (batch, n_heads, seq_len, d_k)
        Q = Q.view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
        K = K.view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
        V = V.view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
        
        # 3. Scaled dot-product attention
        output, attn_weights = self.attention(Q, K, V, mask)
        
        # 4. Concatenate heads
        # (batch, n_heads, seq_len, d_k) → (batch, seq_len, d_model)
        output = output.transpose(1, 2).contiguous().view(
            batch_size, -1, self.d_model
        )
        
        # 5. Final linear projection
        output = self.W_o(output)
        
        return output, attn_weights


class TransformerBlock(nn.Module):
    """Single Transformer Encoder Block."""
    def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
        super().__init__()
        
        # Multi-Head Self-Attention
        self.attention = MultiHeadAttention(d_model, n_heads)
        
        # Feed-Forward Network
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model)
        )
        
        # Layer Normalization
        self.ln1 = nn.LayerNorm(d_model)
        self.ln2 = nn.LayerNorm(d_model)
        
        # Dropout
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x, mask=None):
        # 1. Self-Attention + Residual + LayerNorm
        attn_out, attn_weights = self.attention(x, x, x, mask)
        x = self.ln1(x + self.dropout(attn_out))
        
        # 2. FFN + Residual + LayerNorm
        ffn_out = self.ffn(x)
        x = self.ln2(x + self.dropout(ffn_out))
        
        return x, attn_weights


# =============================================
# TEST: Jalankan Multi-Head Attention
# =============================================
print("=" * 60)
print("TESTING MULTI-HEAD SELF-ATTENTION")
print("=" * 60)

# Hyperparameters
d_model = 512
n_heads = 8
d_ff = 2048
seq_len = 10
batch_size = 2

# Create model
block = TransformerBlock(d_model, n_heads, d_ff)

# Random input (simulating token embeddings)
x = torch.randn(batch_size, seq_len, d_model)

# Forward pass
output, attn_weights = block(x)

print(f"Input shape: {x.shape}")             # (2, 10, 512)
print(f"Output shape: {output.shape}")        # (2, 10, 512)
print(f"Attention weights shape: {attn_weights.shape}")  # (2, 8, 10, 10)

# Attention weight visualization
print(f"\nAttention weights (Head 0, Sample 0):")
print(f"Shape: {attn_weights[0, 0].shape} (seq_len × seq_len)")
print(f"Row sum: {attn_weights[0, 0].sum(dim=-1).tolist()}")  # ≈ 1.0 per row

# Parameter count
params = sum(p.numel() for p in block.parameters())
print(f"\nParameters in Transformer block: {params:,}")

# Multi-head attention shapes summary
print("\n=== Shape Summary ===")
print(f"d_model = {d_model}")
print(f"n_heads = {n_heads}")
print(f"d_k = d_v = {d_model // n_heads}")
print(f"Q, K, V per head: ({seq_len}, {d_model // n_heads})")
print(f"Attention matrix: ({seq_len}, {seq_len})")
print(f"Output per head: ({seq_len}, {d_model // n_heads})")
print(f"Concat output: ({seq_len}, {d_model})")

10. Quiz: Uji Pemahamanmu!

Setelah membaca tutorial di atas, jawablah 5 pertanyaan berikut untuk menguji pemahamanmu tentang Transformer & Attention:

Pertanyaan 1: Apa yang dimaksud dengan "Self-Attention"?

a) Model memperhatikan input dari pengguna

b) Setiap token dalam sequence attend ke semua token lain dalam sequence yang sama

c) Model hanya fokus pada token pertama

d) Decoder attend ke encoder output

Pertanyaan 2: Mengapa Transformer menggunakan Positional Encoding?

a) Untuk mempercepat training

b) Untuk mengurangi jumlah parameter

c) Karena Transformer memproses semua token paralel dan tidak tahu urutan

d) Untuk memungkinkan multi-head attention

Pertanyaan 3: BERT menggunakan arsitektur...

a) Encoder-Decoder (full Transformer)

b) Encoder-only

c) Decoder-only

d) RNN dengan attention

Pertanyaan 4: Mengapa GPT menggunakan "causal mask" (masked attention)?

a) Agar model lebih cepat saat inference

b) Agar model hanya bisa melihat token sebelumnya (autoregressive)

c) Untuk mengurangi overfitting

d) Agar model bisa menangani bahasa yang berbeda

Pertanyaan 5: Apa fungsi dari scaling factor √d_k dalam Scaled Dot-Product Attention?

a) Membuat model lebih kecil

b) Mencegah dot product terlalu besar yang menyebabkan softmax menghasilkan gradient sangat kecil

c) Mengurangi jumlah parameter

d) Menambah non-linearity pada model