RNN: Recurrent Neural Networks — Sequence Modeling & LSTM

📋 Daftar Isi

Pengenalan Sequence Data & RNN
Arsitektur RNN Detail
Backpropagation Through Time (BPTT)
Vanishing & Exploding Gradient Problem
LSTM: Long Short-Term Memory
GRU: Gated Recurrent Unit
Implementasi LSTM dengan PyTorch
Bidirectional & Stacked RNN
Quiz Pemahaman

1. Pengenalan Sequence Data & RNN

Banyak data di dunia nyata bersifat sekuensial — urutan elemen memiliki makna. Recurrent Neural Network (RNN) adalah arsitektur neural network yang dirancang khusus untuk memproses data sekuensial dengan memiliki "memori" dari langkah sebelumnya.

Apa Itu Sequence Data?

Jenis Data	Contoh	Tugas ML
Teks / Natural Language	Kalimat, artikel, chat	Terjemahan, sentiment analysis, text generation
Time Series	Harga saham, cuaca, sensor	Prediksi masa depan, deteksi anomali
Audio / Speech	Rekaman suara, musik	Speech recognition, music generation
Video	Klip video berurutan frame	Action recognition, video captioning
Biological Sequences	DNA, protein	Gene prediction, protein structure

Mengapa Tidak MLP Biasa?

Diagram: MLP vs RNN untuk Data Sekuensial

  MLP (Feedforward)                RNN (Recurrent)
  ┌───────────────────────┐       ┌───────────────────────┐
  │                       │       │                       │
  │ Input: "Saya suka"    │       │ Input: "Saya suka"    │
  │         ↓             │       │         ↓             │
  │ [x₁][x₂][x₃][x₄]    │       │ [t=1]→[t=2]→[t=3]→[t=4]│
  │         ↓             │       │   ↻      ↻      ↻     │
  │  Hidden layers        │       │  (state mengalir       │
  │         ↓             │       │   dari waktu ke waktu) │
  │  Output               │       │         ↓             │
  │                       │       │  Output per timestep   │
  │ ✗ Setiap input        │       │                       │
  │   independen          │       │ ✓ "Ingat" konteks     │
  │ ✗ Tidak ada memori    │       │   dari langkah        │
  │ ✗ Panjang input fixed │       │   sebelumnya          │
  └───────────────────────┘       └───────────────────────┘

2. Arsitektur RNN Detail

RNN memiliki koneksi loop/berulang yang memungkinkan informasi dari timestep sebelumnya mengalir ke timestep berikutnya. Ini membuat RNN memiliki semacam "memori" jangka pendek.

Unrolling RNN

Diagram: RNN Unrolled (Diuraikan)

  RNN (Recurrent)                   RNN (Unrolled / Diuraikan)
  ┌──────────┐                     ┌──────────┐  ┌──────────┐  ┌──────────┐
  │          │                     │          │  │          │  │          │
  │   ┌───┐  │                     │   ┌───┐  │  │   ┌───┐  │  │   ┌───┐  │
  │   │ h │──┼──↻                  │   │ h₁│  │  │   │ h₂│  │  │   │ h₃│  │
  │   └─┬─┘  │                     │   └─┬─┘  │  │   └─┬─┘  │  │   └─┬─┘  │
  │     │    │                     │     │    │  │     │    │  │     │    │
  │     │    │                     │     │    │  │     │    │  │     │    │
  │   ┌─┴─┐  │                     │   ┌─┴─┐  │  │   ┌─┴─┐  │  │   ┌─┴─┐  │
  │   │ x │  │                     │   │ x₁│  │  │   │ x₂│  │  │   │ x₃│  │
  │   └───┘  │                     │   └───┘  │  │   └───┘  │  │   └───┘  │
  │     │    │                     │     │    │  │     │    │  │     │    │
  └─────┼────┘                     └─────┼────┘  └─────┼────┘  └─────┼────┘
        │                               │            │            │
        y                               y₁           y₂           y₃
  
  Persamaan:                        t=1: h₁ = tanh(Wₓₕ·x₁ + Wₕₕ·h₀ + bₕ)
  h(t) = tanh(Wₓₕ·x(t) +            t=2: h₂ = tanh(Wₓₕ·x₂ + Wₕₕ·h₁ + bₕ)
         Wₕₕ·h(t-1) + bₕ)           t=3: h₃ = tanh(Wₓₕ·x₃ + Wₕₕ·h₂ + bₕ)
  y(t) = Wₕᵧ·h(t) + bᵧ
  
  Di mana:
  x(t) = input pada timestep t
  h(t) = hidden state pada timestep t
  y(t) = output pada timestep t
  Wₓₕ  = weight input ke hidden
  Wₕₕ  = weight hidden ke hidden (koneksi rekuren!)
  Wₕᵧ  = weight hidden ke output
  bₕ,bᵧ = bias

Mode Operasi RNN

Diagram: Berbagai Mode RNN

  1. One-to-One          2. One-to-Many        3. Many-to-One
  ┌────────┐            ┌────────┐            ┌────────┐
  │   x    │            │   x    │            │ x₁,x₂,│..xₙ
  │   ↓    │            │   ↓    │            │  ↓↓↓↓  │
  │   y    │            │ y₁ y₂ │..yₙ       │   y    │
  └────────┘            └────────┘            └────────┘
  Klasifikasi           Image Captioning      Sentiment
  gambar tunggal        "Kucing duduk"        Analysis
  
  4. Many-to-Many       5. Seq-to-Seq (Encoder-Decoder)
  ┌────────┐            ┌──────────────┐  ┌──────────────┐
  │ x₁ x₂ │..xₙ       │ Encoder      │  │ Decoder      │
  │ ↓  ↓  │  ↓         │ x₁→x₂→..→xₙ │→ │ →y₁→y₂→..→yₙ│
  │ y₁ y₂ │..yₙ       │    [context] │  │              │
  └────────┘            └──────────────┘  └──────────────┘
  POS Tagging           Machine Translation
                        (Indonesia → Inggris)

3. Backpropagation Through Time (BPTT)

RNN dilatih menggunakan Backpropagation Through Time (BPTT) — versi backpropagation yang "menguraikan" RNN sepanjang waktu dan menghitung gradien dari semua timestep sekaligus.

Diagram: BPTT

  Forward Pass (atas ke bawah)     Backward Pass (gradient mengalir mundur)
  
  t=1: x₁ → h₁ → y₁ → L₁        t=3: ∂L₃/∂h₃ → ∂L₃/∂W
         ↓                        t=2: ∂L₂/∂h₂ + ∂h₃/∂h₂·∂L₃/∂h₃
  t=2: x₂ → h₂ → y₂ → L₂        t=1: ∂L₁/∂h₁ + ∂h₂/∂h₁·∂L₂/∂h₁
         ↓                        
  t=3: x₃ → h₃ → y₃ → L₃        Total gradient = Σ ∂Lₜ/∂W
         
  Total Loss L = L₁ + L₂ + L₃    Gradient mengalir mundur melalui
                                   semua timestep dan diakumulasikan

4. Vanishing & Exploding Gradient Problem

Masalah terbesar RNN vanilla adalah vanishing gradient — saat backpropagation melalui banyak timestep, gradien bisa mengecil mendekati nol (vanishing) atau membesar tak terkendali (exploding).

Diagram: Vanishing vs Exploding Gradient

  VANISHING GRADIENT              EXPLODING GRADIENT
  
  Gradient                       Gradient
  │                              │                          ★
  │★★★★                          │                         ★
  │      ★★★                     │                        ★
  │          ★★★                 │                       ★
  │             ★★★              │                      ★
  │                ★★★★          │                  ★★★
  │                    ★★★★★★★   │  ★★★★★★★★★★★★
  │                              │
  └────────────── timestep →     └────────────── timestep →
    (t=1)              (t=T)       (t=1)              (t=T)
  
  Gradien mengecil → model         Gradien membesar →
  TIDAK bisa belajar dari           training tidak stabil,
  timestep awal (jangka panjang)    weight NaN/Inf
  
  Penyebab:                        Penyebab:
  ∂hₜ/∂hₖ = Π ∂hⱼ/∂hⱼ₋₁          Weight matrix dengan
  Jika ||∂hⱼ/∂hⱼ₋₁|| < 1 →         eigenvalue > 1
  perkalian berulang → mendekati 0

  SOLUSI: LSTM dan GRU!

Solusi untuk Gradient Problem

Solusi	Masalah	Mekanisme
Gradient Clipping	Exploding gradient	Clip gradien jika melebihi threshold: `g = g * threshold/‖g‖`
LSTM / GRU	Vanishing gradient	Gating mechanism untuk kontrol aliran informasi
Residual Connections	Vanishing gradient	Skip connections seperti di ResNet
Proper Initialization	Keduanya	Orthogonal initialization, Xavier/He init
Layer Normalization	Keduanya	Stabilisasi training

Python — Gradient Clipping di PyTorch

import torch
import torch.nn as nn

# === GRADIENT CLIPPING ===
model = nn.RNN(input_size=10, hidden_size=64, num_layers=2)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Setelah backward pass, sebelum optimizer step:
# torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
# optimizer.step()

# Contoh training loop dengan gradient clipping
for epoch in range(10):
    # Forward pass
    output, hidden = model(input_data)
    loss = loss_function(output, target)
    
    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    
    # CLIP gradient (mencegah exploding gradient)
    grad_norm = torch.nn.utils.clip_grad_norm_(
        model.parameters(), max_norm=5.0
    )
    
    # Update weights
    optimizer.step()
    
    if epoch % 2 == 0:
        print(f"Epoch {epoch}: Loss={loss.item():.4f}, Grad Norm={grad_norm:.4f}")

5. LSTM: Long Short-Term Memory

LSTM (Hochreiter & Schmidhuber, 1997) adalah solusi untuk vanishing gradient. LSTM memperkenalkan cell state (jalur memori) dan tiga gate yang mengontrol aliran informasi: forget gate, input gate, dan output gate.

Arsitektur LSTM Detail

Diagram: LSTM Cell Detail

                    ┌─────────────────────────────────────────────────┐
                    │                  LSTM CELL                       │
                    │                                                  │
  C(t-1) ──────────┼──► [×]──────────────────► [+] ──────────────────┼──► C(t)
        │          │   ↑  forget gate          ↑  input gate          │
        │          │   │                       │                      │
        │          │ ┌─┴──────────┐          ┌─┴──────────┐          │
        │          │ │  f(t) = σ  │          │  i(t) = σ  │          │
        │          │ │ (Wf·[h,x]  │          │ (Wi·[h,x]  │          │
        │          │ │    + bf)   │          │    + bi)   │          │
        │          │ └────────────┘          └────────────┘          │
        │          │                                      ↓          │
        │          │                              ┌──────────────┐   │
        │          │                              │ C̃(t) = tanh │   │
        │          │                              │(Wc·[h,x]+bc)│   │
        │          │                              └──────────────┘   │
        │          │                                                  │
        │          │ C(t) ──────► tanh ──► [×] ──────────────────────┼──► h(t)
        │          │              ↑    output gate                    │
        │          │              │                                   │
        │          │            ┌─┴──────────┐                       │
        │          │            │  o(t) = σ  │                       │
        │          │            │ (Wo·[h,x]  │                       │
        │          │            │    + bo)   │                       │
  h(t-1)──────────┼──►         └────────────┘                        │
                    │                                                  │
  x(t) ────────────┼──►                                               │
                    └─────────────────────────────────────────────────┘

  Rumus LSTM:
  ┌──────────────────────────────────────────────────────┐
  │ f(t) = σ(Wf · [h(t-1), x(t)] + bf)   ← Forget Gate │
  │ i(t) = σ(Wi · [h(t-1), x(t)] + bi)   ← Input Gate  │
  │ C̃(t) = tanh(Wc · [h(t-1), x(t)] + bc) ← Candidate  │
  │ C(t) = f(t) ⊙ C(t-1) + i(t) ⊙ C̃(t)  ← Cell State  │
  │ o(t) = σ(Wo · [h(t-1), x(t)] + bo)   ← Output Gate │
  │ h(t) = o(t) ⊙ tanh(C(t))              ← Hidden State│
  └──────────────────────────────────────────────────────┘
  σ = sigmoid, ⊙ = perkalian element-wise

Penjelasan Tiap Gate

Gate	Formula	Fungsi	Analogi
Forget Gate (f)	σ(Wf·[h,x] + bf)	Menentukan informasi mana yang DILUPAKAN dari cell state	Menghapus memori lama yang tidak relevan
Input Gate (i)	σ(Wi·[h,x] + bi)	Menentukan informasi baru mana yang DISIMPAN	Menulis memori baru yang penting
Cell Candidate (C̃)	tanh(Wc·[h,x] + bc)	Membuat kandidat nilai baru untuk cell state	Membuat catatan baru
Output Gate (o)	σ(Wo·[h,x] + bo)	Menentukan bagian cell state yang menjadi output	Memilih mana yang akan "diucapkan"

💡 Mengapa LSTM Mengatasi Vanishing Gradient?

Cell state C(t) mengalir sepanjang waktu melalui operasi penjumlahan (bukan perkalian). Gradien dapat mengalir melalui jalur ini tanpa mengalami perkalian berulang yang menyebabkan vanishing. Forget gate mengontrol seberapa banyak informasi lama yang dipertahankan, sehingga LSTM bisa "mengingat" informasi jangka panjang.

6. GRU: Gated Recurrent Unit

GRU (Cho et al., 2014) adalah penyederhanaan LSTM yang menggabungkan forget gate dan input gate menjadi satu update gate, dan menghilangkan cell state terpisah. GRU lebih ringan dan seringkali memiliki performa setara LSTM.

GRU vs LSTM

Diagram: LSTM vs GRU

  LSTM (3 gates + cell state)        GRU (2 gates, tanpa cell state)
  ┌────────────────────────┐        ┌────────────────────────┐
  │                        │        │                        │
  │ Forget Gate → f(t)     │        │                        │
  │ Input Gate  → i(t)     │        │ Update Gate → z(t)     │
  │ Output Gate → o(t)     │        │ Reset Gate  → r(t)     │
  │ Cell State  → C(t)     │        │ Hidden State → h(t)    │
  │ Hidden State → h(t)    │        │                        │
  │                        │        │                        │
  │ Parameter: 4 × (n²+nm+n)│       │ Parameter: 3 × (n²+nm+n)│
  └────────────────────────┘        └────────────────────────┘

  GRU Rumus:
  ┌─────────────────────────────────────────────────────────┐
  │ z(t) = σ(Wz · [h(t-1), x(t)] + bz)  ← Update Gate     │
  │ r(t) = σ(Wr · [h(t-1), x(t)] + br)  ← Reset Gate      │
  │ h̃(t) = tanh(W · [r(t)⊙h(t-1), x(t)] + b) ← Candidate │
  │ h(t) = (1 - z(t)) ⊙ h(t-1) + z(t) ⊙ h̃(t)  ← Output  │
  └─────────────────────────────────────────────────────────┘

Aspek	LSTM	GRU
Gates	3 (forget, input, output)	2 (update, reset)
Cell State	Ya (terpisah dari hidden state)	Tidak
Parameter	Lebih banyak (~4x)	Lebih sedikit (~3x)
Kecepatan Training	Lebih lambat	Lebih cepat
Memori Jangka Panjang	Lebih baik untuk sekuens sangat panjang	Cukup baik, tapi kadang kurang
Kapan Pakai?	Data kompleks, sekuens panjang	Data terbatas, perlu cepat

7. Implementasi LSTM dengan PyTorch

Contoh 1: LSTM untuk Klasifikasi Sentimen

Python — LSTM Text Classification dengan PyTorch

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from torch.utils.data import DataLoader, TensorDataset

# === DEFINISI MODEL LSTM ===
class LSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim,
                 output_dim, n_layers, dropout):
        super().__init__()
        
        # Layers
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.lstm = nn.LSTM(
            input_size=embed_dim,
            hidden_size=hidden_dim,
            num_layers=n_layers,
            batch_first=True,       # Input shape: (batch, seq_len, features)
            dropout=dropout if n_layers > 1 else 0,
            bidirectional=False
        )
        self.dropout = nn.Dropout(dropout)
        self.fc = nn.Linear(hidden_dim, output_dim)
    
    def forward(self, text, text_lengths):
        # text shape: (batch_size, seq_len)
        
        # 1. Embedding
        embedded = self.dropout(self.embedding(text))
        # embedded: (batch_size, seq_len, embed_dim)
        
        # 2. Pack padded sequences (untuk efisiensi)
        packed = nn.utils.rnn.pack_padded_sequence(
            embedded, text_lengths.cpu(),
            batch_first=True, enforce_sorted=False
        )
        
        # 3. LSTM forward pass
        packed_output, (hidden, cell) = self.lstm(packed)
        # hidden: (n_layers * n_directions, batch, hidden_dim)
        
        # 4. Ambil hidden state dari layer terakhir
        # hidden[-1] untuk unidirectional
        hidden = self.dropout(hidden[-1])
        # hidden: (batch_size, hidden_dim)
        
        # 5. Fully connected layer
        output = self.fc(hidden)
        # output: (batch_size, output_dim)
        
        return output

# === HYPERPARAMETERS ===
VOCAB_SIZE = 10000
EMBED_DIM = 128
HIDDEN_DIM = 256
OUTPUT_DIM = 2  # Binary classification (positif/negatif)
N_LAYERS = 2
DROPOUT = 0.5
LEARNING_RATE = 0.001
BATCH_SIZE = 64
EPOCHS = 10

# === INISIALISASI MODEL ===
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = LSTMClassifier(
    VOCAB_SIZE, EMBED_DIM, HIDDEN_DIM, OUTPUT_DIM, N_LAYERS, DROPOUT
).to(device)

print(f"Model Architecture:\n{model}")
print(f"\nTotal Parameters: {sum(p.numel() for p in model.parameters()):,}")

# === TRAINING LOOP ===
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

def train_epoch(model, dataloader, criterion, optimizer, device):
    model.train()
    total_loss = 0
    correct = 0
    total = 0
    
    for batch_text, batch_lengths, batch_labels in dataloader:
        batch_text = batch_text.to(device)
        batch_labels = batch_labels.to(device)
        
        # Forward
        predictions = model(batch_text, batch_lengths)
        loss = criterion(predictions, batch_labels)
        
        # Backward
        optimizer.zero_grad()
        loss.backward()
        
        # Gradient clipping (mencegah exploding gradient)
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5.0)
        
        optimizer.step()
        
        # Metrics
        total_loss += loss.item()
        predicted = predictions.argmax(dim=1)
        correct += (predicted == batch_labels).sum().item()
        total += batch_labels.size(0)
    
    return total_loss / len(dataloader), correct / total

# Simulated training (ganti dengan data loader asli)
print("\nTraining LSTM Text Classifier...")
for epoch in range(EPOCHS):
    # Simulated metrics
    train_loss = 0.65 * (0.9 ** epoch) + np.random.normal(0, 0.01)
    train_acc = 0.60 + 0.04 * epoch + np.random.normal(0, 0.01)
    print(f"Epoch {epoch+1:2d}/{EPOCHS} | Loss: {train_loss:.4f} | Acc: {train_acc:.4f}")

Contoh 2: LSTM untuk Time Series Forecasting

Python — LSTM Time Series Prediction

import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt

# === GENERATE TIME SERIES DATA ===
np.random.seed(42)
t = np.linspace(0, 100, 1000)
series = np.sin(0.1 * t) + 0.5 * np.sin(0.05 * t) + np.random.normal(0, 0.1, len(t))

# Normalize
mean, std = series.mean(), series.std()
series_norm = (series - mean) / std

# Create sequences
def create_sequences(data, seq_length):
    xs, ys = [], []
    for i in range(len(data) - seq_length):
        xs.append(data[i:i+seq_length])
        ys.append(data[i+seq_length])
    return np.array(xs), np.array(ys)

SEQ_LENGTH = 50
X, y = create_sequences(series_norm, SEQ_LENGTH)

# Split train/test
split = int(0.8 * len(X))
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]

# Convert to tensors
X_train_t = torch.FloatTensor(X_train).unsqueeze(-1)  # (batch, seq, 1)
y_train_t = torch.FloatTensor(y_train)
X_test_t = torch.FloatTensor(X_test).unsqueeze(-1)
y_test_t = torch.FloatTensor(y_test)

# === LSTM MODEL ===
class LSTMForecaster(nn.Module):
    def __init__(self, input_size=1, hidden_size=64, num_layers=2):
        super().__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers,
                           batch_first=True, dropout=0.2)
        self.fc = nn.Linear(hidden_size, 1)
    
    def forward(self, x):
        lstm_out, _ = self.lstm(x)
        output = self.fc(lstm_out[:, -1, :])  # Ambil timestep terakhir
        return output.squeeze()

model = LSTMForecaster()
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Training
print("Training LSTM Forecaster...")
losses = []
for epoch in range(50):
    model.train()
    optimizer.zero_grad()
    pred = model(X_train_t)
    loss = criterion(pred, y_train_t)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()
    losses.append(loss.item())
    
    if (epoch + 1) % 10 == 0:
        model.eval()
        with torch.no_grad():
            test_pred = model(X_test_t)
            test_loss = criterion(test_pred, y_test_t)
        print(f"Epoch {epoch+1}: Train Loss={loss.item():.4f}, Test Loss={test_loss.item():.4f}")

# Visualisasi
model.eval()
with torch.no_grad():
    predictions = model(X_test_t).numpy()

y_test_actual = y_test * std + mean
predictions_actual = predictions * std + mean

plt.figure(figsize=(14, 5))
plt.plot(y_test_actual[:200], label='Actual', alpha=0.8)
plt.plot(predictions_actual[:200], label='Predicted', alpha=0.8)
plt.title('LSTM Time Series Forecasting')
plt.xlabel('Time Step')
plt.ylabel('Value')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

8. Bidirectional & Stacked RNN

Bidirectional RNN

Bidirectional RNN menjalankan dua RNN secara bersamaan — satu dari kiri ke kanan (forward) dan satu dari kanan ke kiri (backward). Output keduanya digabungkan, sehingga model bisa memanfaatkan konteks dari kedua arah.

Diagram: Bidirectional RNN

  Unidirectional RNN              Bidirectional RNN
  
  x₁ → x₂ → x₃ → x₄            x₁ → x₂ → x₃ → x₄
  ↓     ↓     ↓     ↓             ↓     ↓     ↓     ↓
  h₁→  h₂→  h₃→  h₄→            h₁→   h₂→  h₃→  h₄→   (forward)
                                  h₁←   h₂←  h₃←  h₄←   (backward)
                                  ↓     ↓     ↓     ↓
                                  [h₁→;h₁←] [h₂→;h₂←]... (concat)
                                      ↓
                                    Output
  
  Hanya melihat MASA LALPU        Melihat MASA LALU + MASA DEPAN
  → cocok untuk prediksi           → cocok untuk NLP (NER, POS tagging)
    time series                      di mana konteks dua arah penting

Stacked (Deep) RNN

Python — Bidirectional Stacked LSTM

import torch
import torch.nn as nn

class BiLSTMTagger(nn.Module):
    """Bidirectional LSTM untuk Named Entity Recognition (NER)"""
    
    def __init__(self, vocab_size, embed_dim, hidden_dim,
                 output_dim, n_layers, dropout):
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        
        self.lstm = nn.LSTM(
            input_size=embed_dim,
            hidden_size=hidden_dim,
            num_layers=n_layers,
            batch_first=True,
            bidirectional=True,    # Bidirectional!
            dropout=dropout if n_layers > 1 else 0
        )
        
        # Hidden dim × 2 karena bidirectional
        self.fc = nn.Linear(hidden_dim * 2, output_dim)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, text, text_lengths):
        embedded = self.dropout(self.embedding(text))
        
        packed = nn.utils.rnn.pack_padded_sequence(
            embedded, text_lengths.cpu(),
            batch_first=True, enforce_sorted=False
        )
        
        packed_output, (hidden, cell) = self.lstm(packed)
        output, _ = nn.utils.rnn.pad_packed_sequence(
            packed_output, batch_first=True
        )
        
        # Apply FC to setiap timestep (untuk sequence labeling)
        predictions = self.fc(self.dropout(output))
        # predictions: (batch, seq_len, output_dim)
        
        return predictions

# === INISIALISASI ===
model = BiLSTMTagger(
    vocab_size=5000,
    embed_dim=128,
    hidden_dim=256,
    output_dim=9,      # 9 NER tags (B-PER, I-PER, B-LOC, etc.)
    n_layers=2,
    dropout=0.3
)

print(f"Model:\n{model}")
print(f"\nTotal Parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"Hidden dimension: 256 × 2 (bidirectional) = 512")

9. Quiz Pemahaman

🎯 Ringkasan Artikel

RNN memiliki "memori" melalui hidden state yang mengalir antar timestep
Vanishing gradient adalah masalah utama RNN — gradient mengecil saat backpropagation melalui banyak timestep
LSTM mengatasi masalah ini dengan cell state dan 3 gates (forget, input, output)
GRU adalah alternatif LSTM yang lebih ringan dengan 2 gates (update, reset)
Bidirectional RNN memanfaatkan konteks dua arah untuk NLP
Gradient clipping digunakan untuk mencegah exploding gradient
Saat ini, Transformer (BERT, GPT) banyak menggantikan RNN, tapi RNN tetap relevan untuk time series dan edge devices