AI & Data Science

Ensemble Methods: Bagging, Boosting & Kombinasi Model

TOKEN

Panduan lengkap Ensemble Methods β€” konsep bagging vs boosting, Random Forest, AdaBoost, Gradient Boosting, XGBoost, Voting, dan Stacking dengan implementasi Python

1. Pengenalan Ensemble Methods

Ensemble Methods adalah teknik machine learning yang menggabungkan beberapa model (disebut base learner atau weak learner) untuk menghasilkan satu model yang lebih kuat dan akurat. Prinsip dasarnya: "Wisdom of the Crowd" β€” keputusan kolektif dari banyak model lebih baik daripada satu model tunggal.

Analogi sederhana: Jika Anda bertanya kepada 100 orang tentang harga sebuah rumah, rata-rata jawaban mereka kemungkinan lebih akurat daripada satu perkiraan individu β€” meskipun masing-masing orang bisa salah.

Mengapa Ensemble Bekerja?

Diagram: Intuisi Ensemble Methods
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   MENGAPA ENSEMBLE BEKERJA?                      β”‚
β”‚                                                                  β”‚
β”‚  Model 1 (Decision Tree):     Prediksi: Kucing  βœ“ (Benar)       β”‚
β”‚  Model 2 (KNN):               Prediksi: Kucing  βœ“ (Benar)       β”‚
β”‚  Model 3 (Logistic Reg):      Prediksi: Anjing  βœ— (Salah)       β”‚
β”‚  Model 4 (SVM):               Prediksi: Kucing  βœ“ (Benar)       β”‚
β”‚  Model 5 (Naive Bayes):       Prediksi: Anjing  βœ— (Salah)       β”‚
β”‚  ─────────────────────────────────────────────────               β”‚
β”‚  Voting Mayoritas:            Prediksi: Kucing  βœ“ (3 vs 2)      β”‚
β”‚                                                                  β”‚
β”‚  Prinsip: Jika setiap model akurasi > 50%,                       β”‚
β”‚  maka kombinasi mereka cenderung lebih akurat!                   β”‚
β”‚                                                                  β”‚
β”‚  Syarat: Model harus DIVERSE (berbeda satu sama lain)            β”‚
β”‚          Jika semua model membuat error yang SAMA,               β”‚
β”‚          ensemble TIDAK membantu.                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Kategori Utama Ensemble Methods

Kategori Prinsip Contoh Algoritma Analogi
BaggingTraining paralel pada subset data berbedaRandom ForestUjian dengan soal acak, jawaban dirata-rata
BoostingTraining sekuensial, model berikutnya perbaiki error model sebelumnyaAdaBoost, XGBoostSiswa belajar dari kesalahan ujian sebelumnya
StackingGunakan output model sebagai input untuk meta-modelStacked GeneralizationGuru mengoreksi jawaban semua murid
VotingKombinasi prediksi dari berbagai jenis modelVoting ClassifierVoting demokrasi β€” mayoritas menang

2. Bagging (Bootstrap Aggregating)

Bagging (Bootstrap Aggregating) adalah teknik ensemble yang dikembangkan oleh Leo Breiman (1996). Bagging mengurangi variance dari model dengan melatih banyak model pada subset data yang di-bootstrap (sampling dengan pengembalian).

Cara Kerja Bagging

Diagram: Proses Bagging
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                       BAGGING PROCESS                            β”‚
β”‚                                                                  β”‚
β”‚  Dataset Original (N samples)                                    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                        β”‚
β”‚  β”‚ β–  β–‘ β–³ β—‹ β˜… β–  β–‘ β–³ β—‹ β˜… β–  β–‘ β–³ β—‹ β˜…      β”‚                        β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                        β”‚
β”‚         β”‚            β”‚            β”‚                              β”‚
β”‚    Bootstrap     Bootstrap     Bootstrap                         β”‚
β”‚    (sampling     (sampling     (sampling                         β”‚
β”‚    + replace)    + replace)    + replace)                        β”‚
β”‚         β–Ό            β–Ό            β–Ό                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                        β”‚
β”‚  β”‚ β–  β–‘ β–  β—‹  β”‚ β”‚ β–‘ β–³ β–‘ β˜…  β”‚ β”‚ β—‹ β˜… β–³ β—‹  β”‚   ← Subset 1,2,3     β”‚
β”‚  β”‚ β–³ β–‘ β˜… β–‘  β”‚ β”‚ β–  β–³ β—‹ β–   β”‚ β”‚ β˜… β–‘ β–  β–³  β”‚     (ada duplikasi)   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                        β”‚
β”‚       β”‚             β”‚             β”‚                              β”‚
β”‚   Train Model   Train Model   Train Model                       β”‚
β”‚    (Tree 1)     (Tree 2)     (Tree 3)                           β”‚
β”‚       β”‚             β”‚             β”‚                              β”‚
β”‚   Prediksi 1    Prediksi 2    Prediksi 3                        β”‚
β”‚       β”‚             β”‚             β”‚                              β”‚
β”‚       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                              β”‚
β”‚                     β–Ό                                            β”‚
β”‚            Agregasi (Voting/Rata-rata)                           β”‚
β”‚                     β–Ό                                            β”‚
β”‚              Final Prediksi                                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Bootstrap: Sampling DENGAN pengembalian (duplicasi data dimungkinkan)
Agregasi:  Klasifikasi β†’ Voting Mayoritas
           Regresi β†’ Rata-rata

Implementasi Bagging dari Scikit-learn

Python β€” Bagging Classifier
import numpy as np
import pandas as pd
from sklearn.ensemble import BaggingClassifier, BaggingRegressor
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# === DATASET ===
X, y = make_classification(
    n_samples=1000, n_features=20, n_informative=10,
    n_redundant=5, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# === BAGGING CLASSIFIER ===
bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(max_depth=10),
    n_estimators=100,       # Jumlah base models
    max_samples=0.8,        # 80% data per bootstrap
    max_features=1.0,       # 100% fitur per model
    bootstrap=True,         # Sampling dengan pengembalian
    bootstrap_features=False,
    random_state=42,
    n_jobs=-1
)

bagging.fit(X_train, y_train)
y_pred_bag = bagging.predict(X_test)

# === PERBANDINGAN: Single Tree vs Bagging ===
single_tree = DecisionTreeClassifier(max_depth=10, random_state=42)
single_tree.fit(X_train, y_train)
y_pred_tree = single_tree.predict(X_test)

print("=== PERBANDINGAN ===")
print(f"Single Decision Tree Accuracy: {accuracy_score(y_test, y_pred_tree):.4f}")
print(f"Bagging (100 Trees) Accuracy:  {accuracy_score(y_test, y_pred_bag):.4f}")
print(f"Improvement: {(accuracy_score(y_test, y_pred_bag) - accuracy_score(y_test, y_pred_tree))*100:.2f}%")

# OOB Score (Out-of-Bag) β€” evaluasi tanpa validation set
bagging_oob = BaggingClassifier(
    estimator=DecisionTreeClassifier(max_depth=10),
    n_estimators=100, max_samples=0.8,
    bootstrap=True, oob_score=True,  # Aktifkan OOB scoring
    random_state=42, n_jobs=-1
)
bagging_oob.fit(X_train, y_train)
print(f"\nOOB Score: {bagging_oob.oob_score_:.4f}")
print(f"Test Score: {bagging_oob.score(X_test, y_test):.4f}")

Out-of-Bag (OOB) Evaluation

πŸ“¦ Apa itu OOB Score?

Saat bootstrap sampling, sekitar 37% data tidak terambil (out-of-bag) untuk setiap model. Data OOB ini bisa digunakan sebagai validation set gratis β€” tanpa perlu split data terpisah. Rata-rata OOB error dari semua model = OOB score.

OOB fraction β‰ˆ (1 - 1/n)^n β‰ˆ 1/e β‰ˆ 0.368 untuk n besar

3. Random Forest

Random Forest adalah perluasan dari Bagging yang menambahkan randomisasi fitur. Selain bootstrap sampling pada data, Random Forest juga melakukan random subset selection pada fitur di setiap split node. Ini membuat setiap tree lebih beragam (diverse).

Random Forest vs Bagging

Diagram: Random Forest vs Bagging
  BAGGING (Standard)                 RANDOM FOREST
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ Setiap tree DIPER-β”‚             β”‚ Setiap tree DIPER-β”‚
  β”‚ BOLEHKAN memakai  β”‚             β”‚ BOLEHKAN memakai  β”‚
  β”‚ SEMUA fitur       β”‚             β”‚ HANYA SEBAGIAN    β”‚
  β”‚                   β”‚             β”‚ fitur (random)    β”‚
  β”‚ Split 1: [f1,f2,  β”‚             β”‚ Split 1: [f1,f3,  β”‚
  β”‚   f3,f4,f5]       β”‚             β”‚   f5] (random 3)  β”‚
  β”‚ Split 2: [f1,f2,  β”‚             β”‚ Split 2: [f2,f4,  β”‚
  β”‚   f3,f4,f5]       β”‚             β”‚   f5] (random 3)  β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  
  β†’ Tree cenderung MIRIP              β†’ Tree lebih BERAGAM
    (korelasi tinggi)                   (korelasi rendah)
  β†’ Variance reduction                β†’ Variance reduction LEBIH BAIK
    kurang optimal

Implementasi Random Forest

Python β€” Random Forest Classifier & Regressor
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.datasets import make_classification, fetch_california_housing
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, classification_report

# === RANDOM FOREST CLASSIFIER ===
X, y = make_classification(
    n_samples=2000, n_features=20, n_informative=12,
    n_redundant=4, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

rf_clf = RandomForestClassifier(
    n_estimators=200,       # 200 decision trees
    max_depth=15,           # Kedalaman maksimum setiap tree
    min_samples_split=5,    # Min sample untuk split node
    min_samples_leaf=2,     # Min sample di leaf node
    max_features='sqrt',    # √n fitur per split (default untuk klasifikasi)
    bootstrap=True,         # Bootstrap sampling
    oob_score=True,         # Out-of-bag evaluation
    random_state=42,
    n_jobs=-1               # Parallel processing
)

rf_clf.fit(X_train, y_train)
y_pred = rf_clf.predict(X_test)

print("=== Random Forest Classifier ===")
print(f"Accuracy (Train): {rf_clf.score(X_train, y_train):.4f}")
print(f"Accuracy (Test):  {accuracy_score(y_test, y_pred):.4f}")
print(f"OOB Score:        {rf_clf.oob_score_:.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred))

# === FEATURE IMPORTANCE ===
feature_importance = rf_clf.feature_importances_
feature_names = [f'Feature_{i}' for i in range(X.shape[1])]

# Sort by importance
sorted_idx = np.argsort(feature_importance)[::-1]

plt.figure(figsize=(12, 6))
plt.bar(range(15), feature_importance[sorted_idx[:15]], color='steelblue')
plt.xticks(range(15), [feature_names[i] for i in sorted_idx[:15]], rotation=45)
plt.title('Top 15 Feature Importance β€” Random Forest')
plt.ylabel('Importance')
plt.tight_layout()
plt.show()

# Print feature importance
print("\nFeature Importance (Top 10):")
for i in range(10):
    idx = sorted_idx[i]
    print(f"  {feature_names[idx]:15s}: {feature_importance[idx]:.4f}")

# === CROSS VALIDATION ===
cv_scores = cross_val_score(rf_clf, X, y, cv=5, scoring='accuracy', n_jobs=-1)
print(f"\nCross Validation (5-fold):")
print(f"  Scores: {cv_scores.round(4)}")
print(f"  Mean: {cv_scores.mean():.4f} Β± {cv_scores.std():.4f}")

Efek Jumlah Trees pada Akurasi

Python β€” Analisis Jumlah Trees
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=1000, n_features=15, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

n_trees_range = [1, 5, 10, 25, 50, 100, 200, 300, 500]
train_scores = []
test_scores = []

for n_trees in n_trees_range:
    rf = RandomForestClassifier(n_estimators=n_trees, max_depth=10,
                                 random_state=42, n_jobs=-1)
    rf.fit(X_train, y_train)
    train_scores.append(rf.score(X_train, y_train))
    test_scores.append(rf.score(X_test, y_test))
    print(f"Trees={n_trees:3d} β†’ Train: {train_scores[-1]:.4f}, Test: {test_scores[-1]:.4f}")

plt.figure(figsize=(10, 6))
plt.plot(n_trees_range, train_scores, 'b-o', label='Training Score')
plt.plot(n_trees_range, test_scores, 'r-o', label='Test Score')
plt.xlabel('Jumlah Trees')
plt.ylabel('Accuracy')
plt.title('Random Forest: Accuracy vs Jumlah Trees')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nKesimpulan: Setelah ~100 trees, penambahan trees memberikan diminishing return")

4. Boosting: AdaBoost, Gradient Boosting

Boosting adalah teknik ensemble yang melatih model secara sekuensial (berurutan). Setiap model baru difokuskan untuk memperbaiki kesalahan model sebelumnya. Berbeda dengan Bagging yang paralel dan independen, Boosting membangun model yang semakin kuat secara bertahap.

Bagging vs Boosting

Diagram: Bagging vs Boosting
  BAGGING (Paralel)                    BOOSTING (Sekuensial)
  
  β”Œβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”
  β”‚Model1β”‚ β”‚Model2β”‚ β”‚Model3β”‚         β”‚Model1β”‚
  β””β”€β”€β”¬β”€β”€β”€β”˜ β””β”€β”€β”¬β”€β”€β”€β”˜ β””β”€β”€β”¬β”€β”€β”€β”˜         β””β”€β”€β”¬β”€β”€β”€β”˜
     β”‚        β”‚        β”‚                 β”‚
     β–Ό        β–Ό        β–Ό                 β–Ό Error₁
  Pred1    Pred2    Pred3             β”Œβ”€β”€β”€β”€β”€β”€β”
     β”‚        β”‚        β”‚              β”‚Model2β”‚ ← Fokus pada
     β””β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”˜              β””β”€β”€β”¬β”€β”€β”€β”˜   Error Model1
              β–Ό                          β”‚
        Agregasi                   β”Œβ”€β”€β”€β”€β”€β”€β”
     (Vote/Rata-rata)              β”‚Model3β”‚ ← Fokus pada
                                   β””β”€β”€β”¬β”€β”€β”€β”˜   Error Model2
                                      β”‚
                                      β–Ό
                                  Agregasi
  
  β€’ Semua model sejajar          β€’ Model berurutan
  β€’ Mengurangi VARIANCE          β€’ Mengurangi BIAS
  β€’ Model bisa paralel           β€’ Model harus sekuensial
  β€’ Tidak mudah overfit          β€’ Bisa overfit (harus hati-hati)

AdaBoost (Adaptive Boosting)

AdaBoost (Freund & Schapire, 1997) adalah boosting pertama yang sukses. Setiap data point diberi weight yang diadaptasi: data yang salah diklasifikasi mendapat weight lebih besar, sehingga model berikutnya lebih fokus pada data sulit tersebut.

Python β€” AdaBoost Classifier
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# === DATASET ===
X, y = make_classification(
    n_samples=1000, n_features=15, n_informative=10,
    n_redundant=3, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# === ADABOOST ===
adaboost = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),  # Stump (tree dangkal)
    n_estimators=200,        # 200 weak learners
    learning_rate=0.1,       # Learning rate kecil = lebih hati-hati
    algorithm='SAMME',       # Algoritma boosting
    random_state=42
)

adaboost.fit(X_train, y_train)

# Akurasi seiring bertambahnya estimator (staged prediction)
train_scores = list(adaboost.staged_score(X_train, y_train))
test_scores = list(adaboost.staged_score(X_test, y_test))

print(f"AdaBoost Final Accuracy:")
print(f"  Train: {train_scores[-1]:.4f}")
print(f"  Test:  {test_scores[-1]:.4f}")

# Plot learning curve
plt.figure(figsize=(10, 6))
n_estimators = range(1, len(train_scores) + 1)
plt.plot(n_estimators, train_scores, 'b-', label='Training', alpha=0.7)
plt.plot(n_estimators, test_scores, 'r-', label='Testing', alpha=0.7)
plt.xlabel('Jumlah Estimators (Weak Learners)')
plt.ylabel('Accuracy')
plt.title('AdaBoost: Accuracy vs Jumlah Estimators')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Perbandingan learning rate
print("\n=== Pengaruh Learning Rate ===")
for lr in [0.01, 0.05, 0.1, 0.5, 1.0]:
    ab = AdaBoostClassifier(
        estimator=DecisionTreeClassifier(max_depth=1),
        n_estimators=200, learning_rate=lr, random_state=42
    )
    ab.fit(X_train, y_train)
    print(f"  LR={lr:.2f} β†’ Train: {ab.score(X_train, y_train):.4f}, "
          f"Test: {ab.score(X_test, y_test):.4f}")

Gradient Boosting

Gradient Boosting (Friedman, 2001) adalah boosting yang lebih umum. Alih-alih menyesuaikan weight data, Gradient Boosting melatih setiap model baru pada residual error (gradient dari loss function) dari model sebelumnya.

Python β€” Gradient Boosting Classifier
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Dataset
X, y = make_classification(n_samples=1000, n_features=15, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# === GRADIENT BOOSTING ===
gb_clf = GradientBoostingClassifier(
    n_estimators=200,       # Jumlah boosting stages
    learning_rate=0.1,      # Shrinkage rate
    max_depth=3,            # Kedalaman setiap tree
    min_samples_split=5,
    min_samples_leaf=2,
    subsample=0.8,          # Stochastic GB (80% data per tree)
    max_features='sqrt',
    random_state=42
)

gb_clf.fit(X_train, y_train)
y_pred = gb_clf.predict(X_test)

print("=== Gradient Boosting ===")
print(f"Accuracy (Train): {gb_clf.score(X_train, y_train):.4f}")
print(f"Accuracy (Test):  {accuracy_score(y_test, y_pred):.4f}")
print(f"Jumlah Trees: {gb_clf.n_estimators_}")

# Staged prediction β€” akurasi per tahap
train_staged = list(gb_clf.staged_score(X_train, y_train))
test_staged = list(gb_clf.staged_score(X_test, y_test))

print(f"\nAkurasi pada iterasi ke-50:  Train={train_staged[49]:.4f}, Test={test_staged[49]:.4f}")
print(f"Akurasi pada iterasi ke-100: Train={train_staged[99]:.4f}, Test={test_staged[99]:.4f}")
print(f"Akurasi pada iterasi ke-200: Train={train_staged[199]:.4f}, Test={test_staged[199]:.4f}")

5. XGBoost β€” Extreme Gradient Boosting

XGBoost (eXtreme Gradient Boosting) adalah implementasi Gradient Boosting yang dioptimasi oleh Tianqi Chen (2016). XGBoost menjadi andalan pemenang kompetisi Kaggle dan salah satu algoritma paling powerful untuk data tabular.

Mengapa XGBoost Begitu Populer?

Fitur Penjelasan
RegularisasiTermasuk L1 (Lasso) dan L2 (Ridge) regularization untuk mencegah overfitting
Handling Missing ValuesSecara otomatis menangani data hilang
Parallel ProcessingParalelisasi pada level split-finding
Tree PruningPruning berbasis max_depth, bukan hanya min_loss
Cross-Validation Built-inCV internal yang efisien
Early StoppingBerhenti otomatis jika performa tidak membaik
Custom ObjectiveBisa definisikan loss function sendiri

Implementasi XGBoost

Python β€” XGBoost Classifier
# pip install xgboost
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from xgboost import XGBClassifier, XGBRegressor, plot_importance, cv as xgb_cv
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score

# === DATASET ===
X, y = make_classification(
    n_samples=2000, n_features=20, n_informative=12,
    n_redundant=4, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# === XGBOOST CLASSIFIER ===
xgb_clf = XGBClassifier(
    n_estimators=300,           # Jumlah boosting rounds
    max_depth=6,                # Kedalaman maksimum tree
    learning_rate=0.1,          # Eta β€” shrinkage rate
    subsample=0.8,              # Fraction data per tree
    colsample_bytree=0.8,       # Fraction fitur per tree
    reg_alpha=0.1,              # L1 regularization
    reg_lambda=1.0,             # L2 regularization
    min_child_weight=3,         # Minimum sum of instance weight
    gamma=0.1,                  # Min loss reduction untuk split
    random_state=42,
    eval_metric='logloss',      # Metrik evaluasi
    early_stopping_rounds=20,   # Stop jika tidak membaik 20 rounds
    n_jobs=-1
)

# Training dengan early stopping
xgb_clf.fit(
    X_train, y_train,
    eval_set=[(X_train, y_train), (X_test, y_test)],
    verbose=50  # Print setiap 50 rounds
)

# Evaluasi
y_pred = xgb_clf.predict(X_test)
y_proba = xgb_clf.predict_proba(X_test)[:, 1]

print(f"\n=== XGBoost Results ===")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"AUC-ROC:  {roc_auc_score(y_test, y_proba):.4f}")
print(f"Best iteration: {xgb_clf.best_iteration}")

# === FEATURE IMPORTANCE ===
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
plot_importance(xgb_clf, max_num_features=15, ax=ax1, importance_type='weight')
ax1.set_title('Feature Importance (Weight)')

plot_importance(xgb_clf, max_num_features=15, ax=ax2, importance_type='gain')
ax2.set_title('Feature Importance (Gain)')
plt.tight_layout()
plt.show()

# === LEARNING CURVE (dari eval set) ===
results = xgb_clf.evals_result()
epochs = len(results['validation_0']['logloss'])
x_axis = range(epochs)

plt.figure(figsize=(10, 6))
plt.plot(x_axis, results['validation_0']['logloss'], 'b-', label='Train')
plt.plot(x_axis, results['validation_1']['logloss'], 'r-', label='Test')
plt.axvline(x=xgb_clf.best_iteration, color='green', linestyle='--',
            label=f'Best Iteration ({xgb_clf.best_iteration})')
plt.xlabel('Boosting Round')
plt.ylabel('Log Loss')
plt.title('XGBoost: Training vs Validation Loss')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

XGBoost Cross-Validation Built-in

Python β€” XGBoost CV
import xgboost as xgb
import numpy as np
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=2000, n_features=20, random_state=42)

# Convert ke DMatrix (format internal XGBoost)
dtrain = xgb.DMatrix(X, label=y)

# Parameter
params = {
    'max_depth': 6,
    'learning_rate': 0.1,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'reg_alpha': 0.1,
    'reg_lambda': 1.0,
    'objective': 'binary:logistic',
    'eval_metric': 'logloss',
    'seed': 42
}

# Cross-validation
cv_results = xgb.cv(
    params=params,
    dtrain=dtrain,
    num_boost_round=500,
    nfold=5,                # 5-fold CV
    metrics=['logloss', 'auc'],
    early_stopping_rounds=30,
    verbose_eval=50,
    seed=42
)

print(f"\nBest rounds: {len(cv_results)}")
print(f"Best train-logloss: {cv_results['train-logloss-mean'].iloc[-1]:.4f}")
print(f"Best test-logloss:  {cv_results['test-logloss-mean'].iloc[-1]:.4f}")
print(f"Best test-AUC:      {cv_results['test-auc-mean'].iloc[-1]:.4f}")
πŸ’‘ XGBoost vs LightGBM vs CatBoost
  • XGBoost β€” Paling populer, stabil, bagus untuk dataset kecil-menengah
  • LightGBM (Microsoft) β€” Lebih cepat untuk dataset besar, leaf-wise growth
  • CatBoost (Yandex) β€” Terbaik untuk fitur kategorikal, tidak perlu encoding
  • Ketiganya sangat powerful β€” pilih berdasarkan kebutuhan spesifik

6. Voting & Stacking

Voting Classifier

Voting menggabungkan prediksi dari beberapa model berbeda. Hard voting menggunakan mayoritas suara, sedangkan Soft voting menggunakan rata-rata probabilitas.

Python β€” Voting & Stacking Classifier
import numpy as np
from sklearn.ensemble import (VotingClassifier, StackingClassifier,
                               RandomForestClassifier, GradientBoostingClassifier)
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score

# Dataset
X, y = make_classification(n_samples=1500, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# === BASE MODELS ===
rf = RandomForestClassifier(n_estimators=100, random_state=42)
gb = GradientBoostingClassifier(n_estimators=100, random_state=42)
svc = SVC(probability=True, random_state=42)
knn = KNeighborsClassifier(n_neighbors=5)

# === HARD VOTING ===
voting_hard = VotingClassifier(
    estimators=[('rf', rf), ('gb', gb), ('svc', svc), ('knn', knn)],
    voting='hard'
)

# === SOFT VOTING ===
voting_soft = VotingClassifier(
    estimators=[('rf', rf), ('gb', gb), ('svc', svc), ('knn', knn)],
    voting='soft'
)

# === STACKING ===
stacking = StackingClassifier(
    estimators=[('rf', rf), ('gb', gb), ('svc', svc), ('knn', knn)],
    final_estimator=LogisticRegression(),  # Meta-learner
    cv=5
)

# === PERBANDINGAN ===
models = {
    'Random Forest': rf,
    'Gradient Boosting': gb,
    'SVC': svc,
    'KNN': knn,
    'Hard Voting': voting_hard,
    'Soft Voting': voting_soft,
    'Stacking': stacking,
}

print("=== PERBANDINGAN ENSEMBLE METHODS ===")
print(f"{'Model':<25} {'Train':<10} {'Test':<10}")
print("=" * 45)

for name, model in models.items():
    model.fit(X_train, y_train)
    train_acc = model.score(X_train, y_train)
    test_acc = model.score(X_test, y_test)
    print(f"{name:<25} {train_acc:<10.4f} {test_acc:<10.4f}")

7. Perbandingan Semua Metode

Aspek Bagging/RF AdaBoost Gradient Boost XGBoost
TrainingParalelSekuensialSekuensialSekuensial (paralel split)
MengurangiVarianceBiasBiasBias + Variance
Overfitting RiskRendahSedangSedang-TinggiRendah (regularisasi)
Handling MissingTidakTidakTidakβœ… Ya
SpeedCepatSedangLambatCepat
Best Use CaseBaseline kuat, data besarData bersih, model sederhanaKompetisi, data tabularKompetisi Kaggle, production

8. Hyperparameter Tuning Ensemble

Python β€” RandomizedSearchCV untuk XGBoost
from sklearn.model_selection import RandomizedSearchCV
from xgboost import XGBClassifier
from sklearn.datasets import make_classification
import numpy as np

X, y = make_classification(n_samples=2000, n_features=20, random_state=42)

# Parameter grid
param_dist = {
    'n_estimators': [100, 200, 300, 500],
    'max_depth': [3, 4, 5, 6, 7, 8],
    'learning_rate': [0.01, 0.05, 0.1, 0.2],
    'subsample': [0.6, 0.7, 0.8, 0.9, 1.0],
    'colsample_bytree': [0.6, 0.7, 0.8, 0.9, 1.0],
    'reg_alpha': [0, 0.01, 0.1, 1],
    'reg_lambda': [0.5, 1, 2, 5],
    'min_child_weight': [1, 3, 5, 7],
    'gamma': [0, 0.1, 0.2, 0.3],
}

xgb_model = XGBClassifier(
    random_state=42, eval_metric='logloss', n_jobs=-1
)

search = RandomizedSearchCV(
    xgb_model, param_distributions=param_dist,
    n_iter=50,          # Coba 50 kombinasi random
    cv=5,               # 5-fold CV
    scoring='accuracy',
    random_state=42,
    n_jobs=-1,
    verbose=1
)

search.fit(X, y)

print(f"Best Parameters: {search.best_params_}")
print(f"Best CV Score: {search.best_score_:.4f}")

9. Quiz Pemahaman

🧠 Quiz: Ensemble Methods

1. Apa perbedaan utama antara Bagging dan Boosting?

2. Apa yang membuat Random Forest berbeda dari Bagging biasa?

3. Dalam AdaBoost, apa yang terjadi pada data yang salah diklasifikasi?

4. Keunggulan XGBoost dibanding Gradient Boosting standar?

5. Dalam Stacking, apa fungsi dari meta-learner?

🎯 Ringkasan Artikel
  • Ensemble Methods menggabungkan banyak model untuk performa lebih baik
  • Bagging (paralel) mengurangi variance β€” Random Forest adalah implementasi terbaik
  • Boosting (sekuensial) mengurangi bias β€” AdaBoost, Gradient Boosting, XGBoost
  • XGBoost adalah king of tabular data β€” regularisasi, handling missing, cepat
  • Voting menggabungkan prediksi model berbeda, Stacking menggunakan meta-learner
  • Untuk dataset tabular, ensemble tree-based methods sering mengalahkan deep learning