1. Pengenalan Ensemble Methods
Ensemble Methods adalah teknik machine learning yang menggabungkan beberapa model (disebut base learner atau weak learner) untuk menghasilkan satu model yang lebih kuat dan akurat. Prinsip dasarnya: "Wisdom of the Crowd" β keputusan kolektif dari banyak model lebih baik daripada satu model tunggal.
Analogi sederhana: Jika Anda bertanya kepada 100 orang tentang harga sebuah rumah, rata-rata jawaban mereka kemungkinan lebih akurat daripada satu perkiraan individu β meskipun masing-masing orang bisa salah.
Mengapa Ensemble Bekerja?
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β MENGAPA ENSEMBLE BEKERJA? β β β β Model 1 (Decision Tree): Prediksi: Kucing β (Benar) β β Model 2 (KNN): Prediksi: Kucing β (Benar) β β Model 3 (Logistic Reg): Prediksi: Anjing β (Salah) β β Model 4 (SVM): Prediksi: Kucing β (Benar) β β Model 5 (Naive Bayes): Prediksi: Anjing β (Salah) β β βββββββββββββββββββββββββββββββββββββββββββββββββ β β Voting Mayoritas: Prediksi: Kucing β (3 vs 2) β β β β Prinsip: Jika setiap model akurasi > 50%, β β maka kombinasi mereka cenderung lebih akurat! β β β β Syarat: Model harus DIVERSE (berbeda satu sama lain) β β Jika semua model membuat error yang SAMA, β β ensemble TIDAK membantu. β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Kategori Utama Ensemble Methods
| Kategori | Prinsip | Contoh Algoritma | Analogi |
|---|---|---|---|
| Bagging | Training paralel pada subset data berbeda | Random Forest | Ujian dengan soal acak, jawaban dirata-rata |
| Boosting | Training sekuensial, model berikutnya perbaiki error model sebelumnya | AdaBoost, XGBoost | Siswa belajar dari kesalahan ujian sebelumnya |
| Stacking | Gunakan output model sebagai input untuk meta-model | Stacked Generalization | Guru mengoreksi jawaban semua murid |
| Voting | Kombinasi prediksi dari berbagai jenis model | Voting Classifier | Voting demokrasi β mayoritas menang |
2. Bagging (Bootstrap Aggregating)
Bagging (Bootstrap Aggregating) adalah teknik ensemble yang dikembangkan oleh Leo Breiman (1996). Bagging mengurangi variance dari model dengan melatih banyak model pada subset data yang di-bootstrap (sampling dengan pengembalian).
Cara Kerja Bagging
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β BAGGING PROCESS β
β β
β Dataset Original (N samples) β
β βββββββββββββββββββββββββββββββββββββββ β
β β β β‘ β³ β β
β β‘ β³ β β
β β‘ β³ β β
β β
β βββββββββββββββββββββββββββββββββββββββ β
β β β β β
β Bootstrap Bootstrap Bootstrap β
β (sampling (sampling (sampling β
β + replace) + replace) + replace) β
β βΌ βΌ βΌ β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β β β‘ β β β β β‘ β³ β‘ β
β β β β
β³ β β β Subset 1,2,3 β
β β β³ β‘ β
β‘ β β β β³ β β β β β
β‘ β β³ β (ada duplikasi) β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β β β β
β Train Model Train Model Train Model β
β (Tree 1) (Tree 2) (Tree 3) β
β β β β β
β Prediksi 1 Prediksi 2 Prediksi 3 β
β β β β β
β βββββββββββββββΌββββββββββββββ β
β βΌ β
β Agregasi (Voting/Rata-rata) β
β βΌ β
β Final Prediksi β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Bootstrap: Sampling DENGAN pengembalian (duplicasi data dimungkinkan)
Agregasi: Klasifikasi β Voting Mayoritas
Regresi β Rata-rata
Implementasi Bagging dari Scikit-learn
import numpy as np
import pandas as pd
from sklearn.ensemble import BaggingClassifier, BaggingRegressor
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
# === DATASET ===
X, y = make_classification(
n_samples=1000, n_features=20, n_informative=10,
n_redundant=5, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# === BAGGING CLASSIFIER ===
bagging = BaggingClassifier(
estimator=DecisionTreeClassifier(max_depth=10),
n_estimators=100, # Jumlah base models
max_samples=0.8, # 80% data per bootstrap
max_features=1.0, # 100% fitur per model
bootstrap=True, # Sampling dengan pengembalian
bootstrap_features=False,
random_state=42,
n_jobs=-1
)
bagging.fit(X_train, y_train)
y_pred_bag = bagging.predict(X_test)
# === PERBANDINGAN: Single Tree vs Bagging ===
single_tree = DecisionTreeClassifier(max_depth=10, random_state=42)
single_tree.fit(X_train, y_train)
y_pred_tree = single_tree.predict(X_test)
print("=== PERBANDINGAN ===")
print(f"Single Decision Tree Accuracy: {accuracy_score(y_test, y_pred_tree):.4f}")
print(f"Bagging (100 Trees) Accuracy: {accuracy_score(y_test, y_pred_bag):.4f}")
print(f"Improvement: {(accuracy_score(y_test, y_pred_bag) - accuracy_score(y_test, y_pred_tree))*100:.2f}%")
# OOB Score (Out-of-Bag) β evaluasi tanpa validation set
bagging_oob = BaggingClassifier(
estimator=DecisionTreeClassifier(max_depth=10),
n_estimators=100, max_samples=0.8,
bootstrap=True, oob_score=True, # Aktifkan OOB scoring
random_state=42, n_jobs=-1
)
bagging_oob.fit(X_train, y_train)
print(f"\nOOB Score: {bagging_oob.oob_score_:.4f}")
print(f"Test Score: {bagging_oob.score(X_test, y_test):.4f}")
Out-of-Bag (OOB) Evaluation
Saat bootstrap sampling, sekitar 37% data tidak terambil (out-of-bag) untuk setiap model. Data OOB ini bisa digunakan sebagai validation set gratis β tanpa perlu split data terpisah. Rata-rata OOB error dari semua model = OOB score.
OOB fraction β (1 - 1/n)^n β 1/e β 0.368 untuk n besar
3. Random Forest
Random Forest adalah perluasan dari Bagging yang menambahkan randomisasi fitur. Selain bootstrap sampling pada data, Random Forest juga melakukan random subset selection pada fitur di setiap split node. Ini membuat setiap tree lebih beragam (diverse).
Random Forest vs Bagging
BAGGING (Standard) RANDOM FOREST
βββββββββββββββββββββ βββββββββββββββββββββ
β Setiap tree DIPER-β β Setiap tree DIPER-β
β BOLEHKAN memakai β β BOLEHKAN memakai β
β SEMUA fitur β β HANYA SEBAGIAN β
β β β fitur (random) β
β Split 1: [f1,f2, β β Split 1: [f1,f3, β
β f3,f4,f5] β β f5] (random 3) β
β Split 2: [f1,f2, β β Split 2: [f2,f4, β
β f3,f4,f5] β β f5] (random 3) β
βββββββββββββββββββββ βββββββββββββββββββββ
β Tree cenderung MIRIP β Tree lebih BERAGAM
(korelasi tinggi) (korelasi rendah)
β Variance reduction β Variance reduction LEBIH BAIK
kurang optimal
Implementasi Random Forest
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.datasets import make_classification, fetch_california_housing
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, classification_report
# === RANDOM FOREST CLASSIFIER ===
X, y = make_classification(
n_samples=2000, n_features=20, n_informative=12,
n_redundant=4, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
rf_clf = RandomForestClassifier(
n_estimators=200, # 200 decision trees
max_depth=15, # Kedalaman maksimum setiap tree
min_samples_split=5, # Min sample untuk split node
min_samples_leaf=2, # Min sample di leaf node
max_features='sqrt', # βn fitur per split (default untuk klasifikasi)
bootstrap=True, # Bootstrap sampling
oob_score=True, # Out-of-bag evaluation
random_state=42,
n_jobs=-1 # Parallel processing
)
rf_clf.fit(X_train, y_train)
y_pred = rf_clf.predict(X_test)
print("=== Random Forest Classifier ===")
print(f"Accuracy (Train): {rf_clf.score(X_train, y_train):.4f}")
print(f"Accuracy (Test): {accuracy_score(y_test, y_pred):.4f}")
print(f"OOB Score: {rf_clf.oob_score_:.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred))
# === FEATURE IMPORTANCE ===
feature_importance = rf_clf.feature_importances_
feature_names = [f'Feature_{i}' for i in range(X.shape[1])]
# Sort by importance
sorted_idx = np.argsort(feature_importance)[::-1]
plt.figure(figsize=(12, 6))
plt.bar(range(15), feature_importance[sorted_idx[:15]], color='steelblue')
plt.xticks(range(15), [feature_names[i] for i in sorted_idx[:15]], rotation=45)
plt.title('Top 15 Feature Importance β Random Forest')
plt.ylabel('Importance')
plt.tight_layout()
plt.show()
# Print feature importance
print("\nFeature Importance (Top 10):")
for i in range(10):
idx = sorted_idx[i]
print(f" {feature_names[idx]:15s}: {feature_importance[idx]:.4f}")
# === CROSS VALIDATION ===
cv_scores = cross_val_score(rf_clf, X, y, cv=5, scoring='accuracy', n_jobs=-1)
print(f"\nCross Validation (5-fold):")
print(f" Scores: {cv_scores.round(4)}")
print(f" Mean: {cv_scores.mean():.4f} Β± {cv_scores.std():.4f}")
Efek Jumlah Trees pada Akurasi
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
X, y = make_classification(n_samples=1000, n_features=15, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
n_trees_range = [1, 5, 10, 25, 50, 100, 200, 300, 500]
train_scores = []
test_scores = []
for n_trees in n_trees_range:
rf = RandomForestClassifier(n_estimators=n_trees, max_depth=10,
random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
train_scores.append(rf.score(X_train, y_train))
test_scores.append(rf.score(X_test, y_test))
print(f"Trees={n_trees:3d} β Train: {train_scores[-1]:.4f}, Test: {test_scores[-1]:.4f}")
plt.figure(figsize=(10, 6))
plt.plot(n_trees_range, train_scores, 'b-o', label='Training Score')
plt.plot(n_trees_range, test_scores, 'r-o', label='Test Score')
plt.xlabel('Jumlah Trees')
plt.ylabel('Accuracy')
plt.title('Random Forest: Accuracy vs Jumlah Trees')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("\nKesimpulan: Setelah ~100 trees, penambahan trees memberikan diminishing return")
4. Boosting: AdaBoost, Gradient Boosting
Boosting adalah teknik ensemble yang melatih model secara sekuensial (berurutan). Setiap model baru difokuskan untuk memperbaiki kesalahan model sebelumnya. Berbeda dengan Bagging yang paralel dan independen, Boosting membangun model yang semakin kuat secara bertahap.
Bagging vs Boosting
BAGGING (Paralel) BOOSTING (Sekuensial)
ββββββββ ββββββββ ββββββββ ββββββββ
βModel1β βModel2β βModel3β βModel1β
ββββ¬ββββ ββββ¬ββββ ββββ¬ββββ ββββ¬ββββ
β β β β
βΌ βΌ βΌ βΌ Errorβ
Pred1 Pred2 Pred3 ββββββββ
β β β βModel2β β Fokus pada
ββββββββββΌβββββββββ ββββ¬ββββ Error Model1
βΌ β
Agregasi ββββββββ
(Vote/Rata-rata) βModel3β β Fokus pada
ββββ¬ββββ Error Model2
β
βΌ
Agregasi
β’ Semua model sejajar β’ Model berurutan
β’ Mengurangi VARIANCE β’ Mengurangi BIAS
β’ Model bisa paralel β’ Model harus sekuensial
β’ Tidak mudah overfit β’ Bisa overfit (harus hati-hati)
AdaBoost (Adaptive Boosting)
AdaBoost (Freund & Schapire, 1997) adalah boosting pertama yang sukses. Setiap data point diberi weight yang diadaptasi: data yang salah diklasifikasi mendapat weight lebih besar, sehingga model berikutnya lebih fokus pada data sulit tersebut.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# === DATASET ===
X, y = make_classification(
n_samples=1000, n_features=15, n_informative=10,
n_redundant=3, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# === ADABOOST ===
adaboost = AdaBoostClassifier(
estimator=DecisionTreeClassifier(max_depth=1), # Stump (tree dangkal)
n_estimators=200, # 200 weak learners
learning_rate=0.1, # Learning rate kecil = lebih hati-hati
algorithm='SAMME', # Algoritma boosting
random_state=42
)
adaboost.fit(X_train, y_train)
# Akurasi seiring bertambahnya estimator (staged prediction)
train_scores = list(adaboost.staged_score(X_train, y_train))
test_scores = list(adaboost.staged_score(X_test, y_test))
print(f"AdaBoost Final Accuracy:")
print(f" Train: {train_scores[-1]:.4f}")
print(f" Test: {test_scores[-1]:.4f}")
# Plot learning curve
plt.figure(figsize=(10, 6))
n_estimators = range(1, len(train_scores) + 1)
plt.plot(n_estimators, train_scores, 'b-', label='Training', alpha=0.7)
plt.plot(n_estimators, test_scores, 'r-', label='Testing', alpha=0.7)
plt.xlabel('Jumlah Estimators (Weak Learners)')
plt.ylabel('Accuracy')
plt.title('AdaBoost: Accuracy vs Jumlah Estimators')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Perbandingan learning rate
print("\n=== Pengaruh Learning Rate ===")
for lr in [0.01, 0.05, 0.1, 0.5, 1.0]:
ab = AdaBoostClassifier(
estimator=DecisionTreeClassifier(max_depth=1),
n_estimators=200, learning_rate=lr, random_state=42
)
ab.fit(X_train, y_train)
print(f" LR={lr:.2f} β Train: {ab.score(X_train, y_train):.4f}, "
f"Test: {ab.score(X_test, y_test):.4f}")
Gradient Boosting
Gradient Boosting (Friedman, 2001) adalah boosting yang lebih umum. Alih-alih menyesuaikan weight data, Gradient Boosting melatih setiap model baru pada residual error (gradient dari loss function) dari model sebelumnya.
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score
# Dataset
X, y = make_classification(n_samples=1000, n_features=15, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# === GRADIENT BOOSTING ===
gb_clf = GradientBoostingClassifier(
n_estimators=200, # Jumlah boosting stages
learning_rate=0.1, # Shrinkage rate
max_depth=3, # Kedalaman setiap tree
min_samples_split=5,
min_samples_leaf=2,
subsample=0.8, # Stochastic GB (80% data per tree)
max_features='sqrt',
random_state=42
)
gb_clf.fit(X_train, y_train)
y_pred = gb_clf.predict(X_test)
print("=== Gradient Boosting ===")
print(f"Accuracy (Train): {gb_clf.score(X_train, y_train):.4f}")
print(f"Accuracy (Test): {accuracy_score(y_test, y_pred):.4f}")
print(f"Jumlah Trees: {gb_clf.n_estimators_}")
# Staged prediction β akurasi per tahap
train_staged = list(gb_clf.staged_score(X_train, y_train))
test_staged = list(gb_clf.staged_score(X_test, y_test))
print(f"\nAkurasi pada iterasi ke-50: Train={train_staged[49]:.4f}, Test={test_staged[49]:.4f}")
print(f"Akurasi pada iterasi ke-100: Train={train_staged[99]:.4f}, Test={test_staged[99]:.4f}")
print(f"Akurasi pada iterasi ke-200: Train={train_staged[199]:.4f}, Test={test_staged[199]:.4f}")
5. XGBoost β Extreme Gradient Boosting
XGBoost (eXtreme Gradient Boosting) adalah implementasi Gradient Boosting yang dioptimasi oleh Tianqi Chen (2016). XGBoost menjadi andalan pemenang kompetisi Kaggle dan salah satu algoritma paling powerful untuk data tabular.
Mengapa XGBoost Begitu Populer?
| Fitur | Penjelasan |
|---|---|
| Regularisasi | Termasuk L1 (Lasso) dan L2 (Ridge) regularization untuk mencegah overfitting |
| Handling Missing Values | Secara otomatis menangani data hilang |
| Parallel Processing | Paralelisasi pada level split-finding |
| Tree Pruning | Pruning berbasis max_depth, bukan hanya min_loss |
| Cross-Validation Built-in | CV internal yang efisien |
| Early Stopping | Berhenti otomatis jika performa tidak membaik |
| Custom Objective | Bisa definisikan loss function sendiri |
Implementasi XGBoost
# pip install xgboost
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from xgboost import XGBClassifier, XGBRegressor, plot_importance, cv as xgb_cv
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score
# === DATASET ===
X, y = make_classification(
n_samples=2000, n_features=20, n_informative=12,
n_redundant=4, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# === XGBOOST CLASSIFIER ===
xgb_clf = XGBClassifier(
n_estimators=300, # Jumlah boosting rounds
max_depth=6, # Kedalaman maksimum tree
learning_rate=0.1, # Eta β shrinkage rate
subsample=0.8, # Fraction data per tree
colsample_bytree=0.8, # Fraction fitur per tree
reg_alpha=0.1, # L1 regularization
reg_lambda=1.0, # L2 regularization
min_child_weight=3, # Minimum sum of instance weight
gamma=0.1, # Min loss reduction untuk split
random_state=42,
eval_metric='logloss', # Metrik evaluasi
early_stopping_rounds=20, # Stop jika tidak membaik 20 rounds
n_jobs=-1
)
# Training dengan early stopping
xgb_clf.fit(
X_train, y_train,
eval_set=[(X_train, y_train), (X_test, y_test)],
verbose=50 # Print setiap 50 rounds
)
# Evaluasi
y_pred = xgb_clf.predict(X_test)
y_proba = xgb_clf.predict_proba(X_test)[:, 1]
print(f"\n=== XGBoost Results ===")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"AUC-ROC: {roc_auc_score(y_test, y_proba):.4f}")
print(f"Best iteration: {xgb_clf.best_iteration}")
# === FEATURE IMPORTANCE ===
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
plot_importance(xgb_clf, max_num_features=15, ax=ax1, importance_type='weight')
ax1.set_title('Feature Importance (Weight)')
plot_importance(xgb_clf, max_num_features=15, ax=ax2, importance_type='gain')
ax2.set_title('Feature Importance (Gain)')
plt.tight_layout()
plt.show()
# === LEARNING CURVE (dari eval set) ===
results = xgb_clf.evals_result()
epochs = len(results['validation_0']['logloss'])
x_axis = range(epochs)
plt.figure(figsize=(10, 6))
plt.plot(x_axis, results['validation_0']['logloss'], 'b-', label='Train')
plt.plot(x_axis, results['validation_1']['logloss'], 'r-', label='Test')
plt.axvline(x=xgb_clf.best_iteration, color='green', linestyle='--',
label=f'Best Iteration ({xgb_clf.best_iteration})')
plt.xlabel('Boosting Round')
plt.ylabel('Log Loss')
plt.title('XGBoost: Training vs Validation Loss')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
XGBoost Cross-Validation Built-in
import xgboost as xgb
import numpy as np
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=2000, n_features=20, random_state=42)
# Convert ke DMatrix (format internal XGBoost)
dtrain = xgb.DMatrix(X, label=y)
# Parameter
params = {
'max_depth': 6,
'learning_rate': 0.1,
'subsample': 0.8,
'colsample_bytree': 0.8,
'reg_alpha': 0.1,
'reg_lambda': 1.0,
'objective': 'binary:logistic',
'eval_metric': 'logloss',
'seed': 42
}
# Cross-validation
cv_results = xgb.cv(
params=params,
dtrain=dtrain,
num_boost_round=500,
nfold=5, # 5-fold CV
metrics=['logloss', 'auc'],
early_stopping_rounds=30,
verbose_eval=50,
seed=42
)
print(f"\nBest rounds: {len(cv_results)}")
print(f"Best train-logloss: {cv_results['train-logloss-mean'].iloc[-1]:.4f}")
print(f"Best test-logloss: {cv_results['test-logloss-mean'].iloc[-1]:.4f}")
print(f"Best test-AUC: {cv_results['test-auc-mean'].iloc[-1]:.4f}")
- XGBoost β Paling populer, stabil, bagus untuk dataset kecil-menengah
- LightGBM (Microsoft) β Lebih cepat untuk dataset besar, leaf-wise growth
- CatBoost (Yandex) β Terbaik untuk fitur kategorikal, tidak perlu encoding
- Ketiganya sangat powerful β pilih berdasarkan kebutuhan spesifik
6. Voting & Stacking
Voting Classifier
Voting menggabungkan prediksi dari beberapa model berbeda. Hard voting menggunakan mayoritas suara, sedangkan Soft voting menggunakan rata-rata probabilitas.
import numpy as np
from sklearn.ensemble import (VotingClassifier, StackingClassifier,
RandomForestClassifier, GradientBoostingClassifier)
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score
# Dataset
X, y = make_classification(n_samples=1500, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# === BASE MODELS ===
rf = RandomForestClassifier(n_estimators=100, random_state=42)
gb = GradientBoostingClassifier(n_estimators=100, random_state=42)
svc = SVC(probability=True, random_state=42)
knn = KNeighborsClassifier(n_neighbors=5)
# === HARD VOTING ===
voting_hard = VotingClassifier(
estimators=[('rf', rf), ('gb', gb), ('svc', svc), ('knn', knn)],
voting='hard'
)
# === SOFT VOTING ===
voting_soft = VotingClassifier(
estimators=[('rf', rf), ('gb', gb), ('svc', svc), ('knn', knn)],
voting='soft'
)
# === STACKING ===
stacking = StackingClassifier(
estimators=[('rf', rf), ('gb', gb), ('svc', svc), ('knn', knn)],
final_estimator=LogisticRegression(), # Meta-learner
cv=5
)
# === PERBANDINGAN ===
models = {
'Random Forest': rf,
'Gradient Boosting': gb,
'SVC': svc,
'KNN': knn,
'Hard Voting': voting_hard,
'Soft Voting': voting_soft,
'Stacking': stacking,
}
print("=== PERBANDINGAN ENSEMBLE METHODS ===")
print(f"{'Model':<25} {'Train':<10} {'Test':<10}")
print("=" * 45)
for name, model in models.items():
model.fit(X_train, y_train)
train_acc = model.score(X_train, y_train)
test_acc = model.score(X_test, y_test)
print(f"{name:<25} {train_acc:<10.4f} {test_acc:<10.4f}")
7. Perbandingan Semua Metode
| Aspek | Bagging/RF | AdaBoost | Gradient Boost | XGBoost |
|---|---|---|---|---|
| Training | Paralel | Sekuensial | Sekuensial | Sekuensial (paralel split) |
| Mengurangi | Variance | Bias | Bias | Bias + Variance |
| Overfitting Risk | Rendah | Sedang | Sedang-Tinggi | Rendah (regularisasi) |
| Handling Missing | Tidak | Tidak | Tidak | β Ya |
| Speed | Cepat | Sedang | Lambat | Cepat |
| Best Use Case | Baseline kuat, data besar | Data bersih, model sederhana | Kompetisi, data tabular | Kompetisi Kaggle, production |
8. Hyperparameter Tuning Ensemble
from sklearn.model_selection import RandomizedSearchCV
from xgboost import XGBClassifier
from sklearn.datasets import make_classification
import numpy as np
X, y = make_classification(n_samples=2000, n_features=20, random_state=42)
# Parameter grid
param_dist = {
'n_estimators': [100, 200, 300, 500],
'max_depth': [3, 4, 5, 6, 7, 8],
'learning_rate': [0.01, 0.05, 0.1, 0.2],
'subsample': [0.6, 0.7, 0.8, 0.9, 1.0],
'colsample_bytree': [0.6, 0.7, 0.8, 0.9, 1.0],
'reg_alpha': [0, 0.01, 0.1, 1],
'reg_lambda': [0.5, 1, 2, 5],
'min_child_weight': [1, 3, 5, 7],
'gamma': [0, 0.1, 0.2, 0.3],
}
xgb_model = XGBClassifier(
random_state=42, eval_metric='logloss', n_jobs=-1
)
search = RandomizedSearchCV(
xgb_model, param_distributions=param_dist,
n_iter=50, # Coba 50 kombinasi random
cv=5, # 5-fold CV
scoring='accuracy',
random_state=42,
n_jobs=-1,
verbose=1
)
search.fit(X, y)
print(f"Best Parameters: {search.best_params_}")
print(f"Best CV Score: {search.best_score_:.4f}")
9. Quiz Pemahaman
π§ Quiz: Ensemble Methods
1. Apa perbedaan utama antara Bagging dan Boosting?
2. Apa yang membuat Random Forest berbeda dari Bagging biasa?
3. Dalam AdaBoost, apa yang terjadi pada data yang salah diklasifikasi?
4. Keunggulan XGBoost dibanding Gradient Boosting standar?
5. Dalam Stacking, apa fungsi dari meta-learner?
- Ensemble Methods menggabungkan banyak model untuk performa lebih baik
- Bagging (paralel) mengurangi variance β Random Forest adalah implementasi terbaik
- Boosting (sekuensial) mengurangi bias β AdaBoost, Gradient Boosting, XGBoost
- XGBoost adalah king of tabular data β regularisasi, handling missing, cepat
- Voting menggabungkan prediksi model berbeda, Stacking menggunakan meta-learner
- Untuk dataset tabular, ensemble tree-based methods sering mengalahkan deep learning