AI & Data Science

CNN: Convolutional Neural Networks

TOKEN

Tutorial lengkap CNN β€” convolution layers, pooling, padding & stride, arsitektur populer (LeNet, AlexNet, VGG, ResNet), image classification, dan implementasi dengan PyTorch

1. Pengenalan CNN

Convolutional Neural Network (CNN) adalah jenis neural network yang dirancang khusus untuk memproses data berbentuk grid, terutama gambar (2D grid pixel). CNN merevolusi bidang Computer Vision dan menjadi fondasi dari hampir semua sistem pengenalan gambar modern.

Mengapa CNN, Bukan Neural Network Biasa?

Bayangkan kita menggunakan neural network biasa (Fully Connected / Dense) untuk gambar 224Γ—224 piksel dengan 3 channel (RGB). Input layer saja akan memiliki 224 Γ— 224 Γ— 3 = 150.528 neuron. Jika hidden layer pertama memiliki 1000 neuron, maka hanya layer pertama saja sudah membutuhkan 150 juta parameter! Ini tidak praktis.

Diagram: Mengapa CNN Lebih Efisien
FULLY CONNECTED (FC) NETWORK:          CNN (CONVOLUTION):

Input: 224Γ—224Γ—3 = 150,528 neuron      Input: 224Γ—224Γ—3

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 150,528 ──► 1000        β”‚            β”‚ Conv 3Γ—3 filter         β”‚
β”‚ Parameter: 150 juta     β”‚            β”‚ Parameter: hanya 27     β”‚
β”‚ (sangat banyak!)        β”‚            β”‚ per filter!             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Problems dengan FC:                    Keunggulan CNN:
β€’ Terlalu banyak parameter             β€’ Parameter sharing
β€’ Overfitting parah                    β€’ Local connectivity
β€’ Tidak bisa handle spatial            β€’ Translation invariance
  information                          β€’ Sangat efisien
β€’ Input harus fixed size

CNN mengambil inspirasi dari cara kerja mata manusia:
β€’ Neuron lokal β€” hanya "melihat" area kecil (receptive field)
β€’ Feature yang sama berguna di mana saja β†’ parameter sharing!
β€’ Hierarchical features: edge β†’ texture β†’ parts β†’ object

Aplikasi CNN

Aplikasi Contoh Kenapa CNN?
Image ClassificationLabel foto (kucing/anjing/mobil)Deteksi pola visual hierarkis
Object DetectionYOLO, Faster R-CNNDeteksi + lokalisasi objek dalam gambar
Semantic SegmentationU-Net (medical imaging)Klasifikasi setiap pixel
Face RecognitionFaceNet, DeepFaceFeature extraction wajah
Self-driving CarsTesla, WaymoDeteksi rambu, pejalan kaki, jalur
Medical ImagingDeteksi tumor, kankerPola abnormal pada X-ray/MRI
Text ClassificationSentiment analysis1D CNN untuk text
Video AnalysisAction recognition3D CNN untuk video frames

2. Convolution Layer

Convolution layer adalah blok utama dari CNN. Layer ini menggunakan filter (juga disebut kernel atau weight) yang "bergeser" melintasi input gambar untuk mendeteksi fitur-fitur tertentu.

Bagaimana Convolution Bekerja

Diagram: Operasi Convolution 3Γ—3
INPUT (5Γ—5):                FILTER/Kernel (3Γ—3):
β”Œβ”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”
β”‚ 1 β”‚ 2 β”‚ 0 β”‚ 1 β”‚ 3 β”‚      β”‚  1 β”‚  0 β”‚ -1 β”‚
β”œβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”€      β”œβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€
β”‚ 0 β”‚ 1 β”‚ 2 β”‚ 3 β”‚ 1 β”‚      β”‚  1 β”‚  0 β”‚ -1 β”‚
β”œβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”€      β”œβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€
β”‚ 2 β”‚ 3 β”‚ 1 β”‚ 0 β”‚ 2 β”‚      β”‚  1 β”‚  0 β”‚ -1 β”‚
β”œβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”€      β””β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”˜
β”‚ 1 β”‚ 0 β”‚ 3 β”‚ 2 β”‚ 1 β”‚
β”œβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”€      ← Vertical edge detector!
β”‚ 3 β”‚ 2 β”‚ 1 β”‚ 0 β”‚ 3 β”‚      (mendeteksi tepi vertikal)
β””β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”˜

CONVOLUTION OPERATION (posisi top-left):
β”Œβ”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”
β”‚ 1 β”‚ 2 β”‚ 0 β”‚  Γ—  β”‚  1 β”‚  0 β”‚ -1 β”‚
β”œβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”€     β”œβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€
β”‚ 0 β”‚ 1 β”‚ 2 β”‚  Γ—  β”‚  1 β”‚  0 β”‚ -1 β”‚
β”œβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”€     β”œβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€
β”‚ 2 β”‚ 3 β”‚ 1 β”‚  Γ—  β”‚  1 β”‚  0 β”‚ -1 β”‚
β””β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”˜

= (1Γ—1 + 2Γ—0 + 0Γ—(-1)) + (0Γ—1 + 1Γ—0 + 2Γ—(-1)) + (2Γ—1 + 3Γ—0 + 1Γ—(-1))
= (1 + 0 + 0) + (0 + 0 + -2) + (2 + 0 + -1)
= 1 + (-2) + 1 = 0

β†’ Geser filter ke kanan 1 langkah β†’ hitung lagi
β†’ Ulangi sampai semua posisi tercakup

Output FEATURE MAP (3Γ—3):
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”
β”‚   0  β”‚   1  β”‚   5  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€
β”‚  -1  β”‚   1  β”‚  -3  β”‚  ← Edge detected!
β”œβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€
β”‚   2  β”‚   2  β”‚   5  β”‚
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”˜

Jenis Filter/Kernel

CNN secara otomatis belajar filter apa yang paling berguna. Tapi untuk pemahaman, berikut beberapa filter manual yang dikenal:

Filter Kernel Fungsi
Vertical Edge[[-1,0,1],[-1,0,1],[-1,0,1]]Deteksi tepi vertikal
Horizontal Edge[[-1,-1,-1],[0,0,0],[1,1,1]]Deteksi tepi horizontal
Sharpen[[0,-1,0],[-1,5,-1],[0,-1,0]]Memperjelas gambar
Blur (Gaussian)[[1,2,1],[2,4,2],[1,2,1]]/16Memblur gambar (mengurangi noise)
Embassy[[-2,-1,0],[-1,1,1],[0,1,2]]Memberi efek 3D
Python β€” Convolution Manual dengan NumPy
import numpy as np

def conv2d_manual(image, kernel):
    """Implementasi convolution 2D secara manual."""
    h, w = image.shape
    kh, kw = kernel.shape
    out_h = h - kh + 1
    out_w = w - kw + 1
    output = np.zeros((out_h, out_w))
    
    for i in range(out_h):
        for j in range(out_w):
            # Ambil patch dari image
            patch = image[i:i+kh, j:j+kw]
            # Element-wise multiplication + sum
            output[i, j] = np.sum(patch * kernel)
    
    return output

# Contoh input
image = np.array([
    [1, 2, 0, 1, 3],
    [0, 1, 2, 3, 1],
    [2, 3, 1, 0, 2],
    [1, 0, 3, 2, 1],
    [3, 2, 1, 0, 3]
], dtype=float)

# Vertical edge detector
kernel_vert = np.array([
    [-1, 0, 1],
    [-1, 0, 1],
    [-1, 0, 1]
], dtype=float)

# Horizontal edge detector
kernel_horiz = np.array([
    [-1, -1, -1],
    [ 0,  0,  0],
    [ 1,  1,  1]
], dtype=float)

print("Input Image:")
print(image)
print(f"\nVertical Edge Kernel:")
print(kernel_vert)

output_vert = conv2d_manual(image, kernel_vert)
output_horiz = conv2d_manual(image, kernel_horiz)

print(f"\nOutput (Vertical Edge):")
print(output_vert)
print(f"\nOutput (Horizontal Edge):")
print(output_horiz)

# Dimensi output
h, w = image.shape
kh, kw = kernel_vert.shape
print(f"\nInput size: {h}Γ—{w}")
print(f"Kernel size: {kh}Γ—{kw}")
print(f"Output size: {h-kh+1}Γ—{w-kw+1} = {h-kh+1}Γ—{w-kw+1}")

3. Padding, Stride & Output Size

Padding

Padding adalah menambahkan nol (atau nilai lain) di sekeliling input sebelum melakukan convolution. Padding penting karena tanpa padding, output akan lebih kecil dari input, dan informasi di tepi gambar akan hilang lebih cepat.

Diagram: Padding = 1
ORIGINAL INPUT (3Γ—3):           PADDED INPUT (5Γ—5):
β”Œβ”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”                   β”Œβ”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”
β”‚ 1 β”‚ 2 β”‚ 3 β”‚                   β”‚ 0 β”‚ 0 β”‚ 0 β”‚ 0 β”‚ 0 β”‚
β”œβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”€                   β”œβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”€
β”‚ 4 β”‚ 5 β”‚ 6 β”‚       ───►        β”‚ 0 β”‚ 1 β”‚ 2 β”‚ 3 β”‚ 0 β”‚
β”œβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”€                   β”œβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”€
β”‚ 7 β”‚ 8 β”‚ 9 β”‚                   β”‚ 0 β”‚ 4 β”‚ 5 β”‚ 6 β”‚ 0 β”‚
β””β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”˜                   β”œβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”€
                                β”‚ 0 β”‚ 7 β”‚ 8 β”‚ 9 β”‚ 0 β”‚
                                β”œβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”€
                                β”‚ 0 β”‚ 0 β”‚ 0 β”‚ 0 β”‚ 0 β”‚
                                β””β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”˜

Stride

Stride menentukan berapa langkah filter bergeser pada setiap operasi. Stride 1 = geser satu pixel. Stride 2 = geser dua pixel (output lebih kecil).

Rumus Output Size

πŸ“ Rumus Kalkulasi Output Size

Output Size = (Input Size - Kernel Size + 2 Γ— Padding) / Stride + 1

Contoh: Input 32Γ—32, Kernel 5Γ—5, Padding 2, Stride 1:

Output = (32 - 5 + 2Γ—2) / 1 + 1 = 32 β†’ Output 32Γ—32 (sama!) β€” ini disebut "same" padding

Kombinasi Input Kernel Padding Stride Output
Valid (no padding)32Γ—323Γ—30130Γ—30
Same padding32Γ—323Γ—31132Γ—32
Downsample 2Γ—32Γ—323Γ—31216Γ—16
Large kernel28Γ—285Γ—50124Γ—24
Same large kernel28Γ—285Γ—52128Γ—28
Python β€” Hitung Output Size
def calc_output_size(input_size, kernel_size, padding=0, stride=1):
    """Hitung output size dari convolution layer."""
    return (input_size - kernel_size + 2 * padding) // stride + 1

# Test berbagai konfigurasi
configs = [
    (32, 3, 0, 1, "Valid, stride=1"),
    (32, 3, 1, 1, "Same, stride=1"),
    (32, 3, 1, 2, "Downsample 2x"),
    (28, 5, 0, 1, "5x5 kernel, no pad"),
    (28, 5, 2, 1, "5x5 kernel, same pad"),
    (224, 7, 3, 2, "ResNet first layer"),
]

print("Input  Kernel  Pad  Stride  Output  Deskripsi")
print("-" * 65)
for inp, ker, pad, stride, desc in configs:
    out = calc_output_size(inp, ker, pad, stride)
    print(f"  {inp:3d}    {ker:2d}x{ker:<2d}  {pad:2d}    {stride:2d}     {out:3d}x{out:<3d}   {desc}")

# Multi-layer CNN output size tracker
print("\n=== Arsitektur CNN Sederhana ===")
layers = [
    ("Conv1", 32, 3, 1, 1),
    ("Conv2", 32, 3, 1, 1),
    ("Pool1", 32, 2, 0, 2),  # MaxPool 2x2
    ("Conv3", 16, 3, 1, 1),
    ("Conv4", 16, 3, 1, 1),
    ("Pool2", 16, 2, 0, 2),  # MaxPool 2x2
]

size = 28
print(f"Input: {size}x{size}")
for name, size_val, ker, pad, stride in layers:
    size = calc_output_size(size, ker, pad, stride)
    print(f"  {name}: kernel={ker}x{ker}, pad={pad}, stride={stride} β†’ {size}x{size}")
print(f"Flatten output: {size*size*16} = {size}Γ—{size}Γ—16")

4. Pooling Layer

Pooling layer berfungsi untuk mengurangi ukuran spatial (lebar Γ— tinggi) dari feature maps, sekaligus membuat fitur lebih robust terhadap translasi (geser kecil pada input). Pooling juga mengurangi komputasi dan risiko overfitting.

Jenis Pooling

Diagram: Max Pooling vs Average Pooling
MAX POOLING 2Γ—2 (stride=2):        AVG POOLING 2Γ—2 (stride=2):

Input 4Γ—4:                          Input 4Γ—4:
β”Œβ”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”                   β”Œβ”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”
β”‚ 1 β”‚ 3 β”‚ 2 β”‚ 1 β”‚                   β”‚ 1 β”‚ 3 β”‚ 2 β”‚ 1 β”‚
β”œβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”€                   β”œβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”€
β”‚ 5 β”‚ 6 β”‚ 1 β”‚ 0 β”‚                   β”‚ 5 β”‚ 6 β”‚ 1 β”‚ 0 β”‚
β”œβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”€                   β”œβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”€
β”‚ 2 β”‚ 4 β”‚ 8 β”‚ 7 β”‚                   β”‚ 2 β”‚ 4 β”‚ 8 β”‚ 7 β”‚
β”œβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”€                   β”œβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”Όβ”€β”€β”€β”€
β”‚ 1 β”‚ 3 β”‚ 2 β”‚ 5 β”‚                   β”‚ 1 β”‚ 3 β”‚ 2 β”‚ 5 β”‚
β””β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”˜                   β””β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”΄β”€β”€β”€β”˜

Output 2Γ—2:                         Output 2Γ—2:
β”Œβ”€β”€β”€β”¬β”€β”€β”€β”                           β”Œβ”€β”€β”€β”¬β”€β”€β”€β”
β”‚ 6 β”‚ 2 β”‚  ← max dari              β”‚3.7β”‚1.0β”‚  ← avg dari
β”œβ”€β”€β”€β”Όβ”€β”€β”€β”€    setiap 2Γ—2             β”œβ”€β”€β”€β”Όβ”€β”€β”€β”€    setiap 2Γ—2
β”‚ 4 β”‚ 8 β”‚    region                  β”‚2.5β”‚5.5β”‚    region
β””β”€β”€β”€β”΄β”€β”€β”€β”˜                           β””β”€β”€β”€β”΄β”€β”€β”€β”˜

Max Pooling: ambil nilai MAKSIMUM    Avg Pooling: ambil RATA-RATA
β†’ Paling umum digunakan               β†’ Kadang digunakan di
β†’ Mempertahankan fitur paling          akhir jaringan
  menonjol (edge, texture)
Jenis Pooling Operasi Kelebihan Penggunaan
Max PoolingAmbil nilai maksimum per regionPertahankan fitur kuatPaling umum (default)
Average PoolingAmbil rata-rata per regionSmooth, tidak kehilangan infoGlobal Average Pooling (akhir jaringan)
Global Average PoolingRata-rata SELURUH feature map β†’ 1 nilai per channelSangat mengurangi parameterAkhir CNN sebelum classifier
Stochastic PoolingRandom sampling sesuai distribusiRegularisasiJarang digunakan

5. Arsitektur CNN Lengkap

Sebuah arsitektur CNN lengkap terdiri dari beberapa komponen yang bekerja bersama:

Struktur Umum CNN

Diagram: Arsitektur CNN Lengkap
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  INPUT  │──►│  CONV   │──►│  POOL   │──►│  FLATTEN │──►│  DENSE   │──►│ OUTPUT  β”‚
β”‚  IMAGE  β”‚   β”‚ LAYERS  β”‚   β”‚ LAYERS  β”‚   β”‚          β”‚   β”‚  LAYERS  β”‚   β”‚ SOFTMAX β”‚
β”‚ 28Γ—28Γ—1 β”‚   β”‚         β”‚   β”‚         β”‚   β”‚          β”‚   β”‚          β”‚   β”‚         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  
CONV + Pool Blocks (Feature Extractor)  |  Dense Layers (Classifier)
                                        
Stage 1: Conv β†’ ReLU β†’ Pool             FC Layer 1: 128 neurons + ReLU
  28Γ—28 β†’ Conv 3Γ—3 (32 filters) β†’ Pool  FC Layer 2: 64 neurons + ReLU
  β†’ Output: 14Γ—14Γ—32                     Output: 10 neurons (softmax)
                                        
Stage 2: Conv β†’ ReLU β†’ Pool
  14Γ—14 β†’ Conv 3Γ—3 (64 filters) β†’ Pool
  β†’ Output: 7Γ—7Γ—64
                                        
Stage 3: Conv β†’ ReLU
  7Γ—7 β†’ Conv 3Γ—3 (128 filters)
  β†’ Output: 7Γ—7Γ—128

Global Average Pooling
  7Γ—7Γ—128 β†’ 1Γ—1Γ—128 β†’ Flatten β†’ 128

Aktivasi: ReLU

ReLU (Rectified Linear Unit) adalah fungsi aktivasi yang paling umum digunakan di CNN. Formula: f(x) = max(0, x). ReLU menghilangkan nilai negatif (mengubahnya jadi 0) dan mempertahankan nilai positif.

Python β€” Berbagai Fungsi Aktivasi
import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-5, 5, 200)

# Fungsi aktivasi
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def tanh(x):
    return np.tanh(x)

def relu(x):
    return np.maximum(0, x)

def leaky_relu(x, alpha=0.01):
    return np.where(x > 0, x, alpha * x)

def swish(x):
    return x * sigmoid(x)

activations = {
    'Sigmoid': sigmoid(x),
    'Tanh': tanh(x),
    'ReLU': relu(x),
    'Leaky ReLU (Ξ±=0.01)': leaky_relu(x),
    'Swish (SiLU)': swish(x)
}

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
for ax, (name, values) in zip(axes.ravel(), activations.items()):
    ax.plot(x, values, 'b-', linewidth=2)
    ax.axhline(y=0, color='k', linewidth=0.5)
    ax.axvline(x=0, color='k', linewidth=0.5)
    ax.set_title(name, fontsize=12)
    ax.grid(True, alpha=0.3)
    ax.set_ylim(-2, 5)

# Hide last subplot
axes[1, 2].axis('off')
plt.suptitle('Fungsi Aktivasi Neural Network', fontsize=14)
plt.tight_layout()
plt.show()

# ReLU vs Sigmoid: gradient comparison
print("=== Gradient Comparison ===")
print(f"Sigmoid gradient max: {sigmoid(np.array([0])) * (1 - sigmoid(np.array([0]))):.4f}")
print(f"Saturated sigmoid: {sigmoid(np.array([5])) * (1 - sigmoid(np.array([5]))):.6f}")
print(f"ReLU gradient (x=5): {1.0:.4f}")
print(f"ReLU gradient (x=-5): {0.0:.4f}")
print("\n→ Sigmoid mengalami vanishing gradient!")
print("β†’ ReLU: gradient selalu 0 atau 1 (menghindari vanishing gradient)")

6. Arsitektur Populer

Evolusi Arsitektur CNN

Diagram: Evolusi CNN Architectures
Timeline CNN Architectures:

1998: LeNet-5 (Yann LeCun)
  β”‚   β€’ 5 layers, 60K parameters
  β”‚   β€’ Handwritten digit recognition (MNIST)
  β”‚
2012: AlexNet (Krizhevsky)
  β”‚   β€’ 8 layers, 60M parameters
  β”‚   β€’ ImageNet winner β†’ Deep learning revolution!
  β”‚   β€’ ReLU, Dropout, GPU training
  β”‚
2014: VGGNet (Simonyan & Zisserman)
  β”‚   β€’ 16-19 layers, 138M parameters
  β”‚   β€’ Consistent 3Γ—3 kernels, very deep
  β”‚
2014: GoogLeNet/Inception (Szegedy)
  β”‚   β€’ 22 layers, 6.8M parameters
  β”‚   β€’ Inception module (parallel convolutions)
  β”‚   β€’ 1Γ—1 conv untuk mengurangi parameter
  β”‚
2015: ResNet (He et al.) ⭐ BREAKTHROUGH
  β”‚   β€’ 152 layers, 25.6M parameters
  β”‚   β€’ Residual connections (skip connections)
  β”‚   β€’ Solved vanishing gradient problem
  β”‚   β€’ Akurasi > manusia di ImageNet!
  β”‚
2017: DenseNet (Huang et al.)
  β”‚   β€’ Dense connections (setiap layer β†’ semua layer)
  β”‚   β€’ Feature reuse β†’ parameter efisien
  β”‚
2019: EfficientNet (Tan & Le)
  β”‚   β€’ Compound scaling (width, depth, resolution)
  β”‚   β€’ State-of-the-art dengan parameter minimal
  β”‚
2020+: Vision Transformer (ViT)
      β€’ Menggantikan CNN dengan Transformer
      β€’ Self-attention untuk image patches

Perbandingan Arsitektur

Arsitektur Tahun Layers Parameters Top-5 Acc (ImageNet) Key Innovation
LeNet-51998560Kβ€”CNN pertama yang sukses
AlexNet2012860M84.7%ReLU, Dropout, GPU
VGG-16201416138M92.7%3Γ—3 kernels, simplicity
GoogLeNet2014226.8M93.3%Inception module
ResNet-5020155025.6M96.4%Skip connections
EfficientNet-B02019β€”5.3M97.1%Compound scaling

7. Implementasi CNN dengan PyTorch

Sekawaran kita implementasi CNN lengkap untuk klasifikasi gambar menggunakan PyTorch pada dataset CIFAR-10 (10 kelas: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck).

Python β€” CNN dengan PyTorch (CIFAR-10)
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt
import numpy as np

# =============================================
# 1. DEVICE & HYPERPARAMETERS
# =============================================
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Device: {device}")

BATCH_SIZE = 64
LEARNING_RATE = 0.001
NUM_EPOCHS = 20
NUM_CLASSES = 10

# =============================================
# 2. DATA LOADING & AUGMENTATION
# =============================================
transform_train = transforms.Compose([
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.RandomRotation(10),
    transforms.RandomAffine(0, translate=(0.1, 0.1)),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.4914, 0.4822, 0.4465],
        std=[0.2470, 0.2435, 0.2616]
    )
])

transform_test = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.4914, 0.4822, 0.4465],
        std=[0.2470, 0.2435, 0.2616]
    )
])

# Download CIFAR-10
train_dataset = torchvision.datasets.CIFAR10(
    root='./data', train=True, download=True, transform=transform_train
)
test_dataset = torchvision.datasets.CIFAR10(
    root='./data', train=False, download=True, transform=transform_test
)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, 
                          shuffle=True, num_workers=2)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE,
                         shuffle=False, num_workers=2)

classes = ('airplane', 'automobile', 'bird', 'cat', 'deer',
           'dog', 'frog', 'horse', 'ship', 'truck')

print(f"Training samples: {len(train_dataset)}")
print(f"Test samples: {len(test_dataset)}")

# =============================================
# 3. MODEL DEFINITION
# =============================================
class CNNCIFAR10(nn.Module):
    def __init__(self, num_classes=10):
        super(CNNCIFAR10, self).__init__()
        
        # Block 1: Conv β†’ BN β†’ ReLU β†’ Conv β†’ BN β†’ ReLU β†’ MaxPool
        self.block1 = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),
            nn.Conv2d(32, 32, kernel_size=3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),         # 32Γ—32 β†’ 16Γ—16
            nn.Dropout2d(0.25)
        )
        
        # Block 2
        self.block2 = nn.Sequential(
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.Conv2d(64, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),         # 16Γ—16 β†’ 8Γ—8
            nn.Dropout2d(0.25)
        )
        
        # Block 3
        self.block3 = nn.Sequential(
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.Conv2d(128, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),         # 8Γ—8 β†’ 4Γ—4
            nn.Dropout2d(0.25)
        )
        
        # Global Average Pooling + Classifier
        self.global_pool = nn.AdaptiveAvgPool2d(1)  # 4Γ—4 β†’ 1Γ—1
        
        self.classifier = nn.Sequential(
            nn.Linear(128, 256),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),
            nn.Linear(256, num_classes)
        )
    
    def forward(self, x):
        x = self.block1(x)    # β†’ (32, 16, 16)
        x = self.block2(x)    # β†’ (64, 8, 8)
        x = self.block3(x)    # β†’ (128, 4, 4)
        x = self.global_pool(x)  # β†’ (128, 1, 1)
        x = x.view(x.size(0), -1)  # β†’ (128)
        x = self.classifier(x)
        return x

model = CNNCIFAR10(NUM_CLASSES).to(device)

# Print model summary
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"\nModel Parameters: {total_params:,} total, {trainable_params:,} trainable")

# =============================================
# 4. TRAINING LOOP
# =============================================
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE, weight_decay=1e-4)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=NUM_EPOCHS)

train_losses = []
train_accs = []
test_accs = []

for epoch in range(NUM_EPOCHS):
    model.train()
    running_loss = 0.0
    correct = 0
    total = 0
    
    for batch_idx, (images, labels) in enumerate(train_loader):
        images, labels = images.to(device), labels.to(device)
        
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()
        _, predicted = outputs.max(1)
        total += labels.size(0)
        correct += predicted.eq(labels).sum().item()
    
    scheduler.step()
    
    train_loss = running_loss / len(train_loader)
    train_acc = 100. * correct / total
    train_losses.append(train_loss)
    train_accs.append(train_acc)
    
    # Test evaluation
    model.eval()
    test_correct = 0
    test_total = 0
    
    with torch.no_grad():
        for images, labels in test_loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            _, predicted = outputs.max(1)
            test_total += labels.size(0)
            test_correct += predicted.eq(labels).sum().item()
    
    test_acc = 100. * test_correct / test_total
    test_accs.append(test_acc)
    
    print(f"Epoch [{epoch+1:2d}/{NUM_EPOCHS}] "
          f"Loss: {train_loss:.4f} | "
          f"Train Acc: {train_acc:.2f}% | "
          f"Test Acc: {test_acc:.2f}%")

# =============================================
# 5. VISUALISASI
# =============================================
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Loss curve
axes[0].plot(train_losses, 'b-', linewidth=2)
axes[0].set_title('Training Loss')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].grid(True, alpha=0.3)

# Accuracy curve
axes[1].plot(train_accs, 'b-', linewidth=2, label='Train')
axes[1].plot(test_accs, 'r-', linewidth=2, label='Test')
axes[1].set_title('Accuracy')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy (%)')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

# Sample predictions
dataiter = iter(test_loader)
images, labels = next(dataiter)
images_gpu = images[:8].to(device)
outputs = model(images_gpu)
_, predicted = outputs.max(1)

mean = torch.tensor([0.4914, 0.4822, 0.4465])
std = torch.tensor([0.2470, 0.2435, 0.2616])

for i in range(8):
    ax = axes[2] if i < 1 else None
    img = images[i].permute(1, 2, 0) * std + mean
    img = img.clamp(0, 1)
    axes[2].imshow(img)

plt.suptitle(f'CNN CIFAR-10 β€” Final Test Acc: {test_accs[-1]:.2f}%', fontsize=14)
plt.tight_layout()
plt.show()

8. Transfer Learning

Transfer Learning adalah teknik menggunakan model yang sudah dilatih pada dataset besar (seperti ImageNet) dan mengadaptasikannya untuk tugas baru. Ini sangat efektif karena fitur low-level (edges, textures) yang dipelajari model pre-trained umumnya berguna untuk semua tugas visi.

Strategi Transfer Learning

Diagram: Transfer Learning Strategies
PRE-TRAINED MODEL (ImageNet):
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Feature Extractor (Conv layers)                      β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚ β”‚ Edge    β”‚ β”‚ Texture β”‚ β”‚ Pattern β”‚ β”‚ Object  β”‚    β”‚
β”‚ β”‚ Detectorβ”‚ β”‚ Detectorβ”‚ β”‚ Detectorβ”‚ β”‚ Parts   β”‚    β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Classifier Head (FC layers)                          β”‚
β”‚ [1000 classes: ImageNet]                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

STRATEGY 1: Feature Extraction (Freeze conv, train classifier)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ ❄️ Frozen Feature Extractor (tidak di-training)      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ πŸ”₯ New Classifier Head [10 classes]                  β”‚
β”‚    (di-training dari awal)                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β†’ Cocok untuk: dataset kecil, tugas mirip ImageNet

STRATEGY 2: Fine-Tuning (Unfreeze beberapa conv, train semua)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ ❄️ Frozen Early Layers (edge, texture)               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ πŸ”₯ Unfrozen Later Layers (object parts)              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ πŸ”₯ New Classifier Head [10 classes]                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β†’ Cocok untuk: dataset medium, tugas berbeda dari ImageNet
Python β€” Transfer Learning dengan ResNet
import torch
import torch.nn as nn
import torchvision.models as models

# =============================================
# TRANSFER LEARNING: Pre-trained ResNet-18
# =============================================

# Load pre-trained ResNet-18
model = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)

print("=== Original ResNet-18 ===")
print(f"Total params: {sum(p.numel() for p in model.parameters()):,}")

# =============================================
# STRATEGY 1: Feature Extraction
# =============================================
# Freeze ALL convolution layers
for param in model.parameters():
    param.requires_grad = False

# Replace classifier (final FC layer)
# ResNet-18 final layer: model.fc (512 β†’ 1000)
model.fc = nn.Sequential(
    nn.Linear(512, 256),
    nn.ReLU(),
    nn.Dropout(0.3),
    nn.Linear(256, 10)  # 10 classes for CIFAR-10
)

trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in model.parameters())
print(f"\n=== After Feature Extraction ===")
print(f"Total params: {total_params:,}")
print(f"Trainable params: {trainable_params:,}")
print(f"Frozen params: {total_params - trainable_params:,}")

# =============================================
# STRATEGY 2: Fine-Tuning
# =============================================
model_ft = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)

# Freeze early layers (layer1, layer2)
for name, param in model_ft.named_parameters():
    if 'layer1' in name or 'layer2' in name:
        param.requires_grad = False

# Unfreeze later layers (layer3, layer4) + FC
for name, param in model_ft.named_parameters():
    if 'layer3' in name or 'layer4' in name:
        param.requires_grad = True

# Replace FC
model_ft.fc = nn.Sequential(
    nn.Linear(512, 256),
    nn.ReLU(),
    nn.Dropout(0.3),
    nn.Linear(256, 10)
)

trainable_ft = sum(p.numel() for p in model_ft.parameters() if p.requires_grad)
print(f"\n=== After Fine-Tuning ===")
print(f"Trainable params: {trainable_ft:,}")
print(f"Frozen early layers: layer1, layer2")
print(f"Unfrozen: layer3, layer4, fc")

# =============================================
# DIFFERENT LEARNING RATES (LR Discriminative)
# =============================================
# Berikan learning rate berbeda per layer group
param_groups = [
    {'params': [p for n, p in model_ft.named_parameters()
                if 'layer1' in n or 'layer2' in n],
     'lr': 1e-5},          # Frozen layer: very small LR (if any)
    {'params': [p for n, p in model_ft.named_parameters()
                if 'layer3' in n or 'layer4' in n],
     'lr': 1e-4},          # Middle layers: small LR
    {'params': model_ft.fc.parameters(),
     'lr': 1e-3}           # New layers: larger LR
]

optimizer_ft = torch.optim.Adam(param_groups, weight_decay=1e-4)
print("\n=== Discriminative Learning Rates ===")
print("  Early layers: 1e-5")
print("  Middle layers: 1e-4")
print("  FC layer:     1e-3")

9. Tips & Trik Praktis

πŸ’‘ Tips Terbaik untuk CNN
  • Data Augmentation wajib! β€” Random flip, rotation, crop, color jitter. Ini meningkatkan generalisasi secara signifikan
  • Gunakan Batch Normalization β€” Setelah setiap conv layer. Mempercepat training dan bertindak sebagai regularizer
  • Mulai dengan Transfer Learning β€” Jangan bangun dari nol kecuali dataset sangat unik (misalnya medical imaging)
  • Learning Rate Finder β€” Mulai dari LR kecil, naikkan hingga loss meledak. Ambil LR 10Γ— lebih kecil dari yang "meledak"
  • Cosine Annealing β€” LR scheduler yang menurunkan LR secara kosinus. Lebih baik dari step decay
  • Global Average Pooling β€” Ganti FC layers di akhir dengan GAP. Mengurangi parameter secara drastis
  • Monitor overfitting β€” Jika train acc naik tapi test acc stagnan β†’ terlalu sedikit data atau model terlalu kompleks
  • Mixed Precision Training β€” Gunakan float16 untuk mempercepat training di GPU (2-3Γ— speedup)

10. Quiz: Uji Pemahamanmu!

Setelah membaca tutorial di atas, jawablah 5 pertanyaan berikut untuk menguji pemahamanmu tentang CNN:

Pertanyaan 1: Mengapa CNN lebih efisien dari Fully Connected Network untuk memproses gambar?

a) CNN menggunakan GPU yang lebih cepat
b) CNN menggunakan parameter sharing dan local connectivity
c) CNN menghapus semua informasi dari gambar
d) CNN tidak perlu training

Pertanyaan 2: Apa fungsi utama dari Pooling Layer?

a) Menambah parameter pada model
b) Mengurangi ukuran spatial dan membuat fitur lebih robust
c) Menambah channel pada feature map
d) Menggabungkan fitur dari layer berbeda

Pertanyaan 3: Jika input 32Γ—32, kernel 3Γ—3, padding 1, stride 2, berapa ukuran output?

a) 32Γ—32
b) 30Γ—30
c) 16Γ—16
d) 15Γ—15

Pertanyaan 4: Inovasi utama ResNet adalah...

a) Menggunakan filter yang lebih besar (7Γ—7)
b) Menghapus semua pooling layers
c) Residual (skip) connections yang mengatasi vanishing gradient
d) Menggunakan Sigmoid sebagai aktivasi

Pertanyaan 5: Transfer Learning paling cocok digunakan ketika...

a) Dataset sangat besar (jutaan gambar)
b) Dataset kecil dan tugas mirip dengan pre-training task
c) Model harus sangat cepat untuk inference
d) Tidak tersedia GPU untuk training
πŸ” Zoom
100%
🎨 Tema