1. Pengenalan CNN
Convolutional Neural Network (CNN) adalah jenis neural network yang dirancang khusus untuk memproses data berbentuk grid, terutama gambar (2D grid pixel). CNN merevolusi bidang Computer Vision dan menjadi fondasi dari hampir semua sistem pengenalan gambar modern.
Mengapa CNN, Bukan Neural Network Biasa?
Bayangkan kita menggunakan neural network biasa (Fully Connected / Dense) untuk gambar 224Γ224 piksel dengan 3 channel (RGB). Input layer saja akan memiliki 224 Γ 224 Γ 3 = 150.528 neuron. Jika hidden layer pertama memiliki 1000 neuron, maka hanya layer pertama saja sudah membutuhkan 150 juta parameter! Ini tidak praktis.
FULLY CONNECTED (FC) NETWORK: CNN (CONVOLUTION): Input: 224Γ224Γ3 = 150,528 neuron Input: 224Γ224Γ3 βββββββββββββββββββββββββββ βββββββββββββββββββββββββββ β 150,528 βββΊ 1000 β β Conv 3Γ3 filter β β Parameter: 150 juta β β Parameter: hanya 27 β β (sangat banyak!) β β per filter! β βββββββββββββββββββββββββββ βββββββββββββββββββββββββββ Problems dengan FC: Keunggulan CNN: β’ Terlalu banyak parameter β’ Parameter sharing β’ Overfitting parah β’ Local connectivity β’ Tidak bisa handle spatial β’ Translation invariance information β’ Sangat efisien β’ Input harus fixed size CNN mengambil inspirasi dari cara kerja mata manusia: β’ Neuron lokal β hanya "melihat" area kecil (receptive field) β’ Feature yang sama berguna di mana saja β parameter sharing! β’ Hierarchical features: edge β texture β parts β object
Aplikasi CNN
| Aplikasi | Contoh | Kenapa CNN? |
|---|---|---|
| Image Classification | Label foto (kucing/anjing/mobil) | Deteksi pola visual hierarkis |
| Object Detection | YOLO, Faster R-CNN | Deteksi + lokalisasi objek dalam gambar |
| Semantic Segmentation | U-Net (medical imaging) | Klasifikasi setiap pixel |
| Face Recognition | FaceNet, DeepFace | Feature extraction wajah |
| Self-driving Cars | Tesla, Waymo | Deteksi rambu, pejalan kaki, jalur |
| Medical Imaging | Deteksi tumor, kanker | Pola abnormal pada X-ray/MRI |
| Text Classification | Sentiment analysis | 1D CNN untuk text |
| Video Analysis | Action recognition | 3D CNN untuk video frames |
2. Convolution Layer
Convolution layer adalah blok utama dari CNN. Layer ini menggunakan filter (juga disebut kernel atau weight) yang "bergeser" melintasi input gambar untuk mendeteksi fitur-fitur tertentu.
Bagaimana Convolution Bekerja
INPUT (5Γ5): FILTER/Kernel (3Γ3): βββββ¬ββββ¬ββββ¬ββββ¬ββββ ββββββ¬βββββ¬βββββ β 1 β 2 β 0 β 1 β 3 β β 1 β 0 β -1 β βββββΌββββΌββββΌββββΌββββ€ ββββββΌβββββΌβββββ€ β 0 β 1 β 2 β 3 β 1 β β 1 β 0 β -1 β βββββΌββββΌββββΌββββΌββββ€ ββββββΌβββββΌβββββ€ β 2 β 3 β 1 β 0 β 2 β β 1 β 0 β -1 β βββββΌββββΌββββΌββββΌββββ€ ββββββ΄βββββ΄βββββ β 1 β 0 β 3 β 2 β 1 β βββββΌββββΌββββΌββββΌββββ€ β Vertical edge detector! β 3 β 2 β 1 β 0 β 3 β (mendeteksi tepi vertikal) βββββ΄ββββ΄ββββ΄ββββ΄ββββ CONVOLUTION OPERATION (posisi top-left): βββββ¬ββββ¬ββββ ββββββ¬βββββ¬βββββ β 1 β 2 β 0 β Γ β 1 β 0 β -1 β βββββΌββββΌββββ€ ββββββΌβββββΌβββββ€ β 0 β 1 β 2 β Γ β 1 β 0 β -1 β βββββΌββββΌββββ€ ββββββΌβββββΌβββββ€ β 2 β 3 β 1 β Γ β 1 β 0 β -1 β βββββ΄ββββ΄ββββ ββββββ΄βββββ΄βββββ = (1Γ1 + 2Γ0 + 0Γ(-1)) + (0Γ1 + 1Γ0 + 2Γ(-1)) + (2Γ1 + 3Γ0 + 1Γ(-1)) = (1 + 0 + 0) + (0 + 0 + -2) + (2 + 0 + -1) = 1 + (-2) + 1 = 0 β Geser filter ke kanan 1 langkah β hitung lagi β Ulangi sampai semua posisi tercakup Output FEATURE MAP (3Γ3): ββββββββ¬βββββββ¬βββββββ β 0 β 1 β 5 β ββββββββΌβββββββΌβββββββ€ β -1 β 1 β -3 β β Edge detected! ββββββββΌβββββββΌβββββββ€ β 2 β 2 β 5 β ββββββββ΄βββββββ΄βββββββ
Jenis Filter/Kernel
CNN secara otomatis belajar filter apa yang paling berguna. Tapi untuk pemahaman, berikut beberapa filter manual yang dikenal:
| Filter | Kernel | Fungsi |
|---|---|---|
| Vertical Edge | [[-1,0,1],[-1,0,1],[-1,0,1]] | Deteksi tepi vertikal |
| Horizontal Edge | [[-1,-1,-1],[0,0,0],[1,1,1]] | Deteksi tepi horizontal |
| Sharpen | [[0,-1,0],[-1,5,-1],[0,-1,0]] | Memperjelas gambar |
| Blur (Gaussian) | [[1,2,1],[2,4,2],[1,2,1]]/16 | Memblur gambar (mengurangi noise) |
| Embassy | [[-2,-1,0],[-1,1,1],[0,1,2]] | Memberi efek 3D |
import numpy as np
def conv2d_manual(image, kernel):
"""Implementasi convolution 2D secara manual."""
h, w = image.shape
kh, kw = kernel.shape
out_h = h - kh + 1
out_w = w - kw + 1
output = np.zeros((out_h, out_w))
for i in range(out_h):
for j in range(out_w):
# Ambil patch dari image
patch = image[i:i+kh, j:j+kw]
# Element-wise multiplication + sum
output[i, j] = np.sum(patch * kernel)
return output
# Contoh input
image = np.array([
[1, 2, 0, 1, 3],
[0, 1, 2, 3, 1],
[2, 3, 1, 0, 2],
[1, 0, 3, 2, 1],
[3, 2, 1, 0, 3]
], dtype=float)
# Vertical edge detector
kernel_vert = np.array([
[-1, 0, 1],
[-1, 0, 1],
[-1, 0, 1]
], dtype=float)
# Horizontal edge detector
kernel_horiz = np.array([
[-1, -1, -1],
[ 0, 0, 0],
[ 1, 1, 1]
], dtype=float)
print("Input Image:")
print(image)
print(f"\nVertical Edge Kernel:")
print(kernel_vert)
output_vert = conv2d_manual(image, kernel_vert)
output_horiz = conv2d_manual(image, kernel_horiz)
print(f"\nOutput (Vertical Edge):")
print(output_vert)
print(f"\nOutput (Horizontal Edge):")
print(output_horiz)
# Dimensi output
h, w = image.shape
kh, kw = kernel_vert.shape
print(f"\nInput size: {h}Γ{w}")
print(f"Kernel size: {kh}Γ{kw}")
print(f"Output size: {h-kh+1}Γ{w-kw+1} = {h-kh+1}Γ{w-kw+1}")
3. Padding, Stride & Output Size
Padding
Padding adalah menambahkan nol (atau nilai lain) di sekeliling input sebelum melakukan convolution. Padding penting karena tanpa padding, output akan lebih kecil dari input, dan informasi di tepi gambar akan hilang lebih cepat.
ORIGINAL INPUT (3Γ3): PADDED INPUT (5Γ5):
βββββ¬ββββ¬ββββ βββββ¬ββββ¬ββββ¬ββββ¬ββββ
β 1 β 2 β 3 β β 0 β 0 β 0 β 0 β 0 β
βββββΌββββΌββββ€ βββββΌββββΌββββΌββββΌββββ€
β 4 β 5 β 6 β ββββΊ β 0 β 1 β 2 β 3 β 0 β
βββββΌββββΌββββ€ βββββΌββββΌββββΌββββΌββββ€
β 7 β 8 β 9 β β 0 β 4 β 5 β 6 β 0 β
βββββ΄ββββ΄ββββ βββββΌββββΌββββΌββββΌββββ€
β 0 β 7 β 8 β 9 β 0 β
βββββΌββββΌββββΌββββΌββββ€
β 0 β 0 β 0 β 0 β 0 β
βββββ΄ββββ΄ββββ΄ββββ΄ββββ
Stride
Stride menentukan berapa langkah filter bergeser pada setiap operasi. Stride 1 = geser satu pixel. Stride 2 = geser dua pixel (output lebih kecil).
Rumus Output Size
Output Size = (Input Size - Kernel Size + 2 Γ Padding) / Stride + 1
Contoh: Input 32Γ32, Kernel 5Γ5, Padding 2, Stride 1:
Output = (32 - 5 + 2Γ2) / 1 + 1 = 32 β Output 32Γ32 (sama!) β ini disebut "same" padding
| Kombinasi | Input | Kernel | Padding | Stride | Output |
|---|---|---|---|---|---|
| Valid (no padding) | 32Γ32 | 3Γ3 | 0 | 1 | 30Γ30 |
| Same padding | 32Γ32 | 3Γ3 | 1 | 1 | 32Γ32 |
| Downsample 2Γ | 32Γ32 | 3Γ3 | 1 | 2 | 16Γ16 |
| Large kernel | 28Γ28 | 5Γ5 | 0 | 1 | 24Γ24 |
| Same large kernel | 28Γ28 | 5Γ5 | 2 | 1 | 28Γ28 |
def calc_output_size(input_size, kernel_size, padding=0, stride=1):
"""Hitung output size dari convolution layer."""
return (input_size - kernel_size + 2 * padding) // stride + 1
# Test berbagai konfigurasi
configs = [
(32, 3, 0, 1, "Valid, stride=1"),
(32, 3, 1, 1, "Same, stride=1"),
(32, 3, 1, 2, "Downsample 2x"),
(28, 5, 0, 1, "5x5 kernel, no pad"),
(28, 5, 2, 1, "5x5 kernel, same pad"),
(224, 7, 3, 2, "ResNet first layer"),
]
print("Input Kernel Pad Stride Output Deskripsi")
print("-" * 65)
for inp, ker, pad, stride, desc in configs:
out = calc_output_size(inp, ker, pad, stride)
print(f" {inp:3d} {ker:2d}x{ker:<2d} {pad:2d} {stride:2d} {out:3d}x{out:<3d} {desc}")
# Multi-layer CNN output size tracker
print("\n=== Arsitektur CNN Sederhana ===")
layers = [
("Conv1", 32, 3, 1, 1),
("Conv2", 32, 3, 1, 1),
("Pool1", 32, 2, 0, 2), # MaxPool 2x2
("Conv3", 16, 3, 1, 1),
("Conv4", 16, 3, 1, 1),
("Pool2", 16, 2, 0, 2), # MaxPool 2x2
]
size = 28
print(f"Input: {size}x{size}")
for name, size_val, ker, pad, stride in layers:
size = calc_output_size(size, ker, pad, stride)
print(f" {name}: kernel={ker}x{ker}, pad={pad}, stride={stride} β {size}x{size}")
print(f"Flatten output: {size*size*16} = {size}Γ{size}Γ16")
4. Pooling Layer
Pooling layer berfungsi untuk mengurangi ukuran spatial (lebar Γ tinggi) dari feature maps, sekaligus membuat fitur lebih robust terhadap translasi (geser kecil pada input). Pooling juga mengurangi komputasi dan risiko overfitting.
Jenis Pooling
MAX POOLING 2Γ2 (stride=2): AVG POOLING 2Γ2 (stride=2): Input 4Γ4: Input 4Γ4: βββββ¬ββββ¬ββββ¬ββββ βββββ¬ββββ¬ββββ¬ββββ β 1 β 3 β 2 β 1 β β 1 β 3 β 2 β 1 β βββββΌββββΌββββΌββββ€ βββββΌββββΌββββΌββββ€ β 5 β 6 β 1 β 0 β β 5 β 6 β 1 β 0 β βββββΌββββΌββββΌββββ€ βββββΌββββΌββββΌββββ€ β 2 β 4 β 8 β 7 β β 2 β 4 β 8 β 7 β βββββΌββββΌββββΌββββ€ βββββΌββββΌββββΌββββ€ β 1 β 3 β 2 β 5 β β 1 β 3 β 2 β 5 β βββββ΄ββββ΄ββββ΄ββββ βββββ΄ββββ΄ββββ΄ββββ Output 2Γ2: Output 2Γ2: βββββ¬ββββ βββββ¬ββββ β 6 β 2 β β max dari β3.7β1.0β β avg dari βββββΌββββ€ setiap 2Γ2 βββββΌββββ€ setiap 2Γ2 β 4 β 8 β region β2.5β5.5β region βββββ΄ββββ βββββ΄ββββ Max Pooling: ambil nilai MAKSIMUM Avg Pooling: ambil RATA-RATA β Paling umum digunakan β Kadang digunakan di β Mempertahankan fitur paling akhir jaringan menonjol (edge, texture)
| Jenis Pooling | Operasi | Kelebihan | Penggunaan |
|---|---|---|---|
| Max Pooling | Ambil nilai maksimum per region | Pertahankan fitur kuat | Paling umum (default) |
| Average Pooling | Ambil rata-rata per region | Smooth, tidak kehilangan info | Global Average Pooling (akhir jaringan) |
| Global Average Pooling | Rata-rata SELURUH feature map β 1 nilai per channel | Sangat mengurangi parameter | Akhir CNN sebelum classifier |
| Stochastic Pooling | Random sampling sesuai distribusi | Regularisasi | Jarang digunakan |
5. Arsitektur CNN Lengkap
Sebuah arsitektur CNN lengkap terdiri dari beberapa komponen yang bekerja bersama:
Struktur Umum CNN
βββββββββββ βββββββββββ βββββββββββ βββββββββββ ββββββββββββ βββββββββββ
β INPUT ββββΊβ CONV ββββΊβ POOL ββββΊβ FLATTEN ββββΊβ DENSE ββββΊβ OUTPUT β
β IMAGE β β LAYERS β β LAYERS β β β β LAYERS β β SOFTMAX β
β 28Γ28Γ1 β β β β β β β β β β β
βββββββββββ βββββββββββ βββββββββββ βββββββββββ ββββββββββββ βββββββββββ
CONV + Pool Blocks (Feature Extractor) | Dense Layers (Classifier)
Stage 1: Conv β ReLU β Pool FC Layer 1: 128 neurons + ReLU
28Γ28 β Conv 3Γ3 (32 filters) β Pool FC Layer 2: 64 neurons + ReLU
β Output: 14Γ14Γ32 Output: 10 neurons (softmax)
Stage 2: Conv β ReLU β Pool
14Γ14 β Conv 3Γ3 (64 filters) β Pool
β Output: 7Γ7Γ64
Stage 3: Conv β ReLU
7Γ7 β Conv 3Γ3 (128 filters)
β Output: 7Γ7Γ128
Global Average Pooling
7Γ7Γ128 β 1Γ1Γ128 β Flatten β 128
Aktivasi: ReLU
ReLU (Rectified Linear Unit) adalah fungsi aktivasi yang paling umum digunakan di CNN. Formula: f(x) = max(0, x). ReLU menghilangkan nilai negatif (mengubahnya jadi 0) dan mempertahankan nilai positif.
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(-5, 5, 200)
# Fungsi aktivasi
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def tanh(x):
return np.tanh(x)
def relu(x):
return np.maximum(0, x)
def leaky_relu(x, alpha=0.01):
return np.where(x > 0, x, alpha * x)
def swish(x):
return x * sigmoid(x)
activations = {
'Sigmoid': sigmoid(x),
'Tanh': tanh(x),
'ReLU': relu(x),
'Leaky ReLU (Ξ±=0.01)': leaky_relu(x),
'Swish (SiLU)': swish(x)
}
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
for ax, (name, values) in zip(axes.ravel(), activations.items()):
ax.plot(x, values, 'b-', linewidth=2)
ax.axhline(y=0, color='k', linewidth=0.5)
ax.axvline(x=0, color='k', linewidth=0.5)
ax.set_title(name, fontsize=12)
ax.grid(True, alpha=0.3)
ax.set_ylim(-2, 5)
# Hide last subplot
axes[1, 2].axis('off')
plt.suptitle('Fungsi Aktivasi Neural Network', fontsize=14)
plt.tight_layout()
plt.show()
# ReLU vs Sigmoid: gradient comparison
print("=== Gradient Comparison ===")
print(f"Sigmoid gradient max: {sigmoid(np.array([0])) * (1 - sigmoid(np.array([0]))):.4f}")
print(f"Saturated sigmoid: {sigmoid(np.array([5])) * (1 - sigmoid(np.array([5]))):.6f}")
print(f"ReLU gradient (x=5): {1.0:.4f}")
print(f"ReLU gradient (x=-5): {0.0:.4f}")
print("\nβ Sigmoid mengalami vanishing gradient!")
print("β ReLU: gradient selalu 0 atau 1 (menghindari vanishing gradient)")
6. Arsitektur Populer
Evolusi Arsitektur CNN
Timeline CNN Architectures:
1998: LeNet-5 (Yann LeCun)
β β’ 5 layers, 60K parameters
β β’ Handwritten digit recognition (MNIST)
β
2012: AlexNet (Krizhevsky)
β β’ 8 layers, 60M parameters
β β’ ImageNet winner β Deep learning revolution!
β β’ ReLU, Dropout, GPU training
β
2014: VGGNet (Simonyan & Zisserman)
β β’ 16-19 layers, 138M parameters
β β’ Consistent 3Γ3 kernels, very deep
β
2014: GoogLeNet/Inception (Szegedy)
β β’ 22 layers, 6.8M parameters
β β’ Inception module (parallel convolutions)
β β’ 1Γ1 conv untuk mengurangi parameter
β
2015: ResNet (He et al.) β BREAKTHROUGH
β β’ 152 layers, 25.6M parameters
β β’ Residual connections (skip connections)
β β’ Solved vanishing gradient problem
β β’ Akurasi > manusia di ImageNet!
β
2017: DenseNet (Huang et al.)
β β’ Dense connections (setiap layer β semua layer)
β β’ Feature reuse β parameter efisien
β
2019: EfficientNet (Tan & Le)
β β’ Compound scaling (width, depth, resolution)
β β’ State-of-the-art dengan parameter minimal
β
2020+: Vision Transformer (ViT)
β’ Menggantikan CNN dengan Transformer
β’ Self-attention untuk image patches
Perbandingan Arsitektur
| Arsitektur | Tahun | Layers | Parameters | Top-5 Acc (ImageNet) | Key Innovation |
|---|---|---|---|---|---|
| LeNet-5 | 1998 | 5 | 60K | β | CNN pertama yang sukses |
| AlexNet | 2012 | 8 | 60M | 84.7% | ReLU, Dropout, GPU |
| VGG-16 | 2014 | 16 | 138M | 92.7% | 3Γ3 kernels, simplicity |
| GoogLeNet | 2014 | 22 | 6.8M | 93.3% | Inception module |
| ResNet-50 | 2015 | 50 | 25.6M | 96.4% | Skip connections |
| EfficientNet-B0 | 2019 | β | 5.3M | 97.1% | Compound scaling |
7. Implementasi CNN dengan PyTorch
Sekawaran kita implementasi CNN lengkap untuk klasifikasi gambar menggunakan PyTorch pada dataset CIFAR-10 (10 kelas: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck).
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt
import numpy as np
# =============================================
# 1. DEVICE & HYPERPARAMETERS
# =============================================
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Device: {device}")
BATCH_SIZE = 64
LEARNING_RATE = 0.001
NUM_EPOCHS = 20
NUM_CLASSES = 10
# =============================================
# 2. DATA LOADING & AUGMENTATION
# =============================================
transform_train = transforms.Compose([
transforms.RandomHorizontalFlip(p=0.5),
transforms.RandomRotation(10),
transforms.RandomAffine(0, translate=(0.1, 0.1)),
transforms.ColorJitter(brightness=0.2, contrast=0.2),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.4914, 0.4822, 0.4465],
std=[0.2470, 0.2435, 0.2616]
)
])
transform_test = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize(
mean=[0.4914, 0.4822, 0.4465],
std=[0.2470, 0.2435, 0.2616]
)
])
# Download CIFAR-10
train_dataset = torchvision.datasets.CIFAR10(
root='./data', train=True, download=True, transform=transform_train
)
test_dataset = torchvision.datasets.CIFAR10(
root='./data', train=False, download=True, transform=transform_test
)
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE,
shuffle=True, num_workers=2)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE,
shuffle=False, num_workers=2)
classes = ('airplane', 'automobile', 'bird', 'cat', 'deer',
'dog', 'frog', 'horse', 'ship', 'truck')
print(f"Training samples: {len(train_dataset)}")
print(f"Test samples: {len(test_dataset)}")
# =============================================
# 3. MODEL DEFINITION
# =============================================
class CNNCIFAR10(nn.Module):
def __init__(self, num_classes=10):
super(CNNCIFAR10, self).__init__()
# Block 1: Conv β BN β ReLU β Conv β BN β ReLU β MaxPool
self.block1 = nn.Sequential(
nn.Conv2d(3, 32, kernel_size=3, padding=1),
nn.BatchNorm2d(32),
nn.ReLU(inplace=True),
nn.Conv2d(32, 32, kernel_size=3, padding=1),
nn.BatchNorm2d(32),
nn.ReLU(inplace=True),
nn.MaxPool2d(2, 2), # 32Γ32 β 16Γ16
nn.Dropout2d(0.25)
)
# Block 2
self.block2 = nn.Sequential(
nn.Conv2d(32, 64, kernel_size=3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(inplace=True),
nn.Conv2d(64, 64, kernel_size=3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(inplace=True),
nn.MaxPool2d(2, 2), # 16Γ16 β 8Γ8
nn.Dropout2d(0.25)
)
# Block 3
self.block3 = nn.Sequential(
nn.Conv2d(64, 128, kernel_size=3, padding=1),
nn.BatchNorm2d(128),
nn.ReLU(inplace=True),
nn.Conv2d(128, 128, kernel_size=3, padding=1),
nn.BatchNorm2d(128),
nn.ReLU(inplace=True),
nn.MaxPool2d(2, 2), # 8Γ8 β 4Γ4
nn.Dropout2d(0.25)
)
# Global Average Pooling + Classifier
self.global_pool = nn.AdaptiveAvgPool2d(1) # 4Γ4 β 1Γ1
self.classifier = nn.Sequential(
nn.Linear(128, 256),
nn.ReLU(inplace=True),
nn.Dropout(0.5),
nn.Linear(256, num_classes)
)
def forward(self, x):
x = self.block1(x) # β (32, 16, 16)
x = self.block2(x) # β (64, 8, 8)
x = self.block3(x) # β (128, 4, 4)
x = self.global_pool(x) # β (128, 1, 1)
x = x.view(x.size(0), -1) # β (128)
x = self.classifier(x)
return x
model = CNNCIFAR10(NUM_CLASSES).to(device)
# Print model summary
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"\nModel Parameters: {total_params:,} total, {trainable_params:,} trainable")
# =============================================
# 4. TRAINING LOOP
# =============================================
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE, weight_decay=1e-4)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=NUM_EPOCHS)
train_losses = []
train_accs = []
test_accs = []
for epoch in range(NUM_EPOCHS):
model.train()
running_loss = 0.0
correct = 0
total = 0
for batch_idx, (images, labels) in enumerate(train_loader):
images, labels = images.to(device), labels.to(device)
optimizer.zero_grad()
outputs = model(images)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
running_loss += loss.item()
_, predicted = outputs.max(1)
total += labels.size(0)
correct += predicted.eq(labels).sum().item()
scheduler.step()
train_loss = running_loss / len(train_loader)
train_acc = 100. * correct / total
train_losses.append(train_loss)
train_accs.append(train_acc)
# Test evaluation
model.eval()
test_correct = 0
test_total = 0
with torch.no_grad():
for images, labels in test_loader:
images, labels = images.to(device), labels.to(device)
outputs = model(images)
_, predicted = outputs.max(1)
test_total += labels.size(0)
test_correct += predicted.eq(labels).sum().item()
test_acc = 100. * test_correct / test_total
test_accs.append(test_acc)
print(f"Epoch [{epoch+1:2d}/{NUM_EPOCHS}] "
f"Loss: {train_loss:.4f} | "
f"Train Acc: {train_acc:.2f}% | "
f"Test Acc: {test_acc:.2f}%")
# =============================================
# 5. VISUALISASI
# =============================================
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
# Loss curve
axes[0].plot(train_losses, 'b-', linewidth=2)
axes[0].set_title('Training Loss')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].grid(True, alpha=0.3)
# Accuracy curve
axes[1].plot(train_accs, 'b-', linewidth=2, label='Train')
axes[1].plot(test_accs, 'r-', linewidth=2, label='Test')
axes[1].set_title('Accuracy')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy (%)')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
# Sample predictions
dataiter = iter(test_loader)
images, labels = next(dataiter)
images_gpu = images[:8].to(device)
outputs = model(images_gpu)
_, predicted = outputs.max(1)
mean = torch.tensor([0.4914, 0.4822, 0.4465])
std = torch.tensor([0.2470, 0.2435, 0.2616])
for i in range(8):
ax = axes[2] if i < 1 else None
img = images[i].permute(1, 2, 0) * std + mean
img = img.clamp(0, 1)
axes[2].imshow(img)
plt.suptitle(f'CNN CIFAR-10 β Final Test Acc: {test_accs[-1]:.2f}%', fontsize=14)
plt.tight_layout()
plt.show()
8. Transfer Learning
Transfer Learning adalah teknik menggunakan model yang sudah dilatih pada dataset besar (seperti ImageNet) dan mengadaptasikannya untuk tugas baru. Ini sangat efektif karena fitur low-level (edges, textures) yang dipelajari model pre-trained umumnya berguna untuk semua tugas visi.
Strategi Transfer Learning
PRE-TRAINED MODEL (ImageNet): ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β Feature Extractor (Conv layers) β β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β β β Edge β β Texture β β Pattern β β Object β β β β Detectorβ β Detectorβ β Detectorβ β Parts β β β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β Classifier Head (FC layers) β β [1000 classes: ImageNet] β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ STRATEGY 1: Feature Extraction (Freeze conv, train classifier) ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β βοΈ Frozen Feature Extractor (tidak di-training) β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β π₯ New Classifier Head [10 classes] β β (di-training dari awal) β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β Cocok untuk: dataset kecil, tugas mirip ImageNet STRATEGY 2: Fine-Tuning (Unfreeze beberapa conv, train semua) ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β βοΈ Frozen Early Layers (edge, texture) β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β π₯ Unfrozen Later Layers (object parts) β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ β π₯ New Classifier Head [10 classes] β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β Cocok untuk: dataset medium, tugas berbeda dari ImageNet
import torch
import torch.nn as nn
import torchvision.models as models
# =============================================
# TRANSFER LEARNING: Pre-trained ResNet-18
# =============================================
# Load pre-trained ResNet-18
model = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)
print("=== Original ResNet-18 ===")
print(f"Total params: {sum(p.numel() for p in model.parameters()):,}")
# =============================================
# STRATEGY 1: Feature Extraction
# =============================================
# Freeze ALL convolution layers
for param in model.parameters():
param.requires_grad = False
# Replace classifier (final FC layer)
# ResNet-18 final layer: model.fc (512 β 1000)
model.fc = nn.Sequential(
nn.Linear(512, 256),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(256, 10) # 10 classes for CIFAR-10
)
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in model.parameters())
print(f"\n=== After Feature Extraction ===")
print(f"Total params: {total_params:,}")
print(f"Trainable params: {trainable_params:,}")
print(f"Frozen params: {total_params - trainable_params:,}")
# =============================================
# STRATEGY 2: Fine-Tuning
# =============================================
model_ft = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)
# Freeze early layers (layer1, layer2)
for name, param in model_ft.named_parameters():
if 'layer1' in name or 'layer2' in name:
param.requires_grad = False
# Unfreeze later layers (layer3, layer4) + FC
for name, param in model_ft.named_parameters():
if 'layer3' in name or 'layer4' in name:
param.requires_grad = True
# Replace FC
model_ft.fc = nn.Sequential(
nn.Linear(512, 256),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(256, 10)
)
trainable_ft = sum(p.numel() for p in model_ft.parameters() if p.requires_grad)
print(f"\n=== After Fine-Tuning ===")
print(f"Trainable params: {trainable_ft:,}")
print(f"Frozen early layers: layer1, layer2")
print(f"Unfrozen: layer3, layer4, fc")
# =============================================
# DIFFERENT LEARNING RATES (LR Discriminative)
# =============================================
# Berikan learning rate berbeda per layer group
param_groups = [
{'params': [p for n, p in model_ft.named_parameters()
if 'layer1' in n or 'layer2' in n],
'lr': 1e-5}, # Frozen layer: very small LR (if any)
{'params': [p for n, p in model_ft.named_parameters()
if 'layer3' in n or 'layer4' in n],
'lr': 1e-4}, # Middle layers: small LR
{'params': model_ft.fc.parameters(),
'lr': 1e-3} # New layers: larger LR
]
optimizer_ft = torch.optim.Adam(param_groups, weight_decay=1e-4)
print("\n=== Discriminative Learning Rates ===")
print(" Early layers: 1e-5")
print(" Middle layers: 1e-4")
print(" FC layer: 1e-3")
9. Tips & Trik Praktis
- Data Augmentation wajib! β Random flip, rotation, crop, color jitter. Ini meningkatkan generalisasi secara signifikan
- Gunakan Batch Normalization β Setelah setiap conv layer. Mempercepat training dan bertindak sebagai regularizer
- Mulai dengan Transfer Learning β Jangan bangun dari nol kecuali dataset sangat unik (misalnya medical imaging)
- Learning Rate Finder β Mulai dari LR kecil, naikkan hingga loss meledak. Ambil LR 10Γ lebih kecil dari yang "meledak"
- Cosine Annealing β LR scheduler yang menurunkan LR secara kosinus. Lebih baik dari step decay
- Global Average Pooling β Ganti FC layers di akhir dengan GAP. Mengurangi parameter secara drastis
- Monitor overfitting β Jika train acc naik tapi test acc stagnan β terlalu sedikit data atau model terlalu kompleks
- Mixed Precision Training β Gunakan float16 untuk mempercepat training di GPU (2-3Γ speedup)
10. Quiz: Uji Pemahamanmu!
Setelah membaca tutorial di atas, jawablah 5 pertanyaan berikut untuk menguji pemahamanmu tentang CNN: