1. Mengapa Fine-tuning LLM?
Large Language Model (LLM) seperti Llama, Mistral, dan GPT sudah dilatih dengan triliunan token. Namun, model umum ini belum tentu optimal untuk tugas spesifik Anda. Fine-tuning memungkinkan Anda mengadaptasi model untuk domain tertentu.
Kapan Perlu Fine-tuning?
| Skenario | Solusi | Kompleksitas |
|---|---|---|
| Ingin model tahu domain spesifik | Fine-tuning | Medium |
| Mau gaya output tertentu (formal, singkat) | Fine-tuning | Medium |
| Ada data eksklusif (dokumen perusahaan) | RAG atau Fine-tuning | Medium |
| Butuh akurasi tinggi di tugas spesifik | Fine-tuning | Medium-High |
| Cuma butuh konteks tambahan | RAG (tidak perlu fine-tune) | Low |
| Budget GPU terbatas | QLoRA / LoRA | Medium |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β PENDekatan ADAPTASI LLM β β β β 1. Prompt Engineering β β ββββββββββββ ββββββββββββ β β β System ββββββΆβ LLM ββββββΆ Output β β β Prompt β β (frozen) β β β ββββββββββββ ββββββββββββ β β β Tidak butuh training β β β Terbatas oleh context window β β β β 2. RAG β β ββββββββββββ ββββββββββββ ββββββββββββ β β β Vector ββββββΆβ RetrievalββββββΆβ LLM ββββΆ Output β β β Store β β Chain β β (frozen) β β β ββββββββββββ ββββββββββββ ββββββββββββ β β β Data real-time, tidak perlu training β β β Tidak mengubah gaya/kemampuan model β β β β 3. Fine-tuning (LoRA/QLoRA) β β ββββββββββββ ββββββββββββββββββββββββ β β β Dataset ββββββΆβ LLM + LoRA Adapter ββββΆ Fine-tuned LLM β β β Spesifik β β (hanya train adapter)β β β ββββββββββββ ββββββββββββββββββββββββ β β β Mengubah gaya & kemampuan model β β β Murah dengan LoRA/QLoRA β β β Butuh dataset berkualitas β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
2. Full Fine-tuning vs Parameter-Efficient
Full fine-tuning memperbarui semua parameter model. Untuk model 7B parameter, ini butuh ~28GB VRAM (FP32) atau ~14GB (FP16). Sangat mahal!
Parameter-Efficient Fine-Tuning (PEFT) hanya melatih sebagian kecil parameter β biasanya kurang dari 1% β sambil mempertahankan kualitas yang mendekati full fine-tuning.
Perbandingan Metode PEFT
| Metode | Parameter Dilatih | VRAM | Kualitas |
|---|---|---|---|
| Full Fine-tuning | 100% | ~28GB (7B) | βββββ |
| LoRA | 0.1-1% | ~16GB (7B) | ββββ |
| QLoRA | 0.1-1% | ~6GB (7B) | ββββ |
| Prefix Tuning | <0.1% | ~14GB | βββ |
| Prompt Tuning | <0.01% | ~14GB | ββ |
3. LoRA β Low-Rank Adaptation
LoRA (Low-Rank Adaptation) adalah teknik yang memperkenalkan adapter layers kecil di samping layer existing model. Alih-alih mengubah semua weights, LoRA hanya melatih matriks kecil (rank decomposition) yang diinjeksikan ke layer attention.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β LoRA ARCHITECTURE β β β β Original Layer (frozen): β β Input βββΆ W (dΓd) βββΆ Output β β (frozen) β β β β With LoRA: β β Input βββ¬ββ W (dΓd) βββββββββ¬βββΆ Output β β β (frozen) β β β β β β β βββ A (dΓr) ββ B (rΓd) βββΆ ΞOutput β β (trainable!) (trainable!) β β β β W_bar = W + Ξ± Γ A Γ B β β β β d = model dimension (4096) β β r = LoRA rank (8, 16, 32, 64) β β Ξ± = scaling factor (biasanya 2Γrank) β β β β Contoh: β β d=4096, r=16: β β β Full params: 4096 Γ 4096 = 16.7M params β β β LoRA params: 4096Γ16 + 16Γ4096 = 131K params (0.8%!) β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# =============================================
# LoRA Fine-tuning Setup
# =============================================
# pip install transformers peft accelerate bitsandbytes datasets
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
import torch
# Load base model
model_name = "meta-llama/Llama-3.1-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
# Konfigurasi LoRA
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # Rank (8, 16, 32, 64)
lora_alpha=32, # Scaling factor (biasanya 2Γrank)
lora_dropout=0.05, # Dropout untuk regularisasi
target_modules=[ # Layer yang akan di-LoRA-kan
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
],
bias="none",
)
# Apply LoRA ke model
model = get_peft_model(model, lora_config)
# Print jumlah parameter yang dilatih
model.print_trainable_parameters()
# trainable params: 41,943,040 || all params: 8,072,204,288 ||
# trainable%: 0.5196
Parameter LoRA yang Penting
| Parameter | Nilai Umum | Penjelasan |
|---|---|---|
r (rank) | 8, 16, 32, 64 | Rank matriks β semakin tinggi, semakin banyak parameter |
lora_alpha | 16, 32, 64 | Scaling factor, biasanya 2Γ rank |
lora_dropout | 0.05 - 0.1 | Dropout untuk mencegah overfitting |
target_modules | ["q_proj", "v_proj"] | Layer mana yang di-LoRA-kan |
bias | "none" | "none", "all", atau "lora_only" |
4. QLoRA β Quantized LoRA
QLoRA adalah evolusi dari LoRA yang menambahkan quantization β model base dikuantisasi ke 4-bit, sementara adapter LoRA tetap di 16-bit. Hasilnya: Anda bisa fine-tune model 7B di GPU dengan hanya 6GB VRAM!
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β QLoRA ARCHITECTURE β β β β βββββββββββββββββββββββββββββββββββ β β β Base Model (4-bit NF4) β β Frozen, dikuantisasi β β β βββββββββββββββββββββββββββ β β β β β LlamaDecoderLayer β β β β β β βββββ βββββ βββββ β β β β β β βQ β βK β βV β β β 4-bit NF4 quantization β β β β βprojβ βprojβ βprojβ β β β β β β βββββ βββββ βββββ β β β β β β βββββ βββββ β β β β β β βLoRAβ βLoRAβ β β β β β β β A β β B β (16bit)β β β LoRA adapters (FP16) β β β β βββββ βββββ β β β β β βββββββββββββββββββββββββββ β β β βββββββββββββββββββββββββββββββββββ β β β β VRAM Comparison (7B model): β β Full FT (FP32): ~28 GB β β β Full FT (FP16): ~14 GB β οΈ β β LoRA (FP16): ~16 GB β οΈ β β QLoRA (4-bit): ~6 GB β (RTX 3060 bisa!) β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# =============================================
# QLoRA β Quantized LoRA
# =============================================
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch
# 1. Konfigurasi Quantization (4-bit NF4)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # Normalized Float 4-bit
bnb_4bit_compute_dtype=torch.float16, # Compute dtype
bnb_4bit_use_double_quant=True, # Double quantization
)
# 2. Load model dengan quantization
model_name = "meta-llama/Llama-3.1-8B"
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
# 3. Prepare model untuk k-bit training
model = prepare_model_for_kbit_training(model)
# 4. Tambahkan LoRA adapters
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable: ~41M / all: ~8B = 0.5%
# Cek VRAM usage
print(f"VRAM: {torch.cuda.memory_allocated()/1e9:.1f} GB")
# Sekitar 5-6 GB untuk model 7B!
5. Persiapan Dataset
Kualitas dataset adalah faktor paling penting dalam fine-tuning. Format data tergantung tugas: instruction-following, chat, atau completion.
# =============================================
# Persiapan Dataset untuk Fine-tuning
# =============================================
from datasets import load_dataset, Dataset
# ----- Format 1: Instruction (Alpaca format) -----
dataset = load_dataset("json", data_files="data/instructions.json")
# Contoh format Alpaca:
# {
# "instruction": "Jelaskan apa itu RAG",
# "input": "",
# "output": "RAG (Retrieval Augmented Generation) adalah..."
# }
def format_alpaca(example):
if example.get("input", ""):
text = f"""### Instruksi:
{example['instruction']}
### Input:
{example['input']}
### Jawaban:
{example['output']}{tokenizer.eos_token}"""
else:
text = f"""### Instruksi:
{example['instruction']}
### Jawaban:
{example['output']}{tokenizer.eos_token}"""
return {"text": text}
dataset = dataset.map(format_alpaca)
# ----- Format 2: Chat format -----
def format_chat(example):
messages = [
{"role": "system", "content": "Anda adalah asisten AI yang membantu."},
{"role": "user", "content": example["question"]},
{"role": "assistant", "content": example["answer"]}
]
text = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=False
)
return {"text": text}
# ----- Format 3: Custom dataset dari CSV -----
import pandas as pd
df = pd.read_csv("data/training_data.csv")
dataset = Dataset.from_pandas(df)
dataset = dataset.train_test_split(test_size=0.1, seed=42)
# Tokenize dataset
def tokenize_function(examples):
return tokenizer(
examples["text"],
truncation=True,
max_length=2048,
padding="max_length"
)
tokenized_dataset = dataset.map(
tokenize_function,
batched=True,
remove_columns=dataset["train"].column_names
)
print(f"Train: {len(tokenized_dataset['train'])} samples")
print(f"Test: {len(tokenized_dataset['test'])} samples")
- Kualitas > Kuantitas β 1000 contoh berkualitas lebih baik dari 10.000 contoh berisik
- Konsistensi format β semua data harus mengikuti format yang sama
- Diversitas β variasi cara bertanya untuk topik yang sama
- Validasi β selalu simpan 10-20% data untuk evaluasi
- Bahasa Indonesia β gunakan dataset bilingual untuk model multilingual
6. Training Configuration & Execution
# =============================================
# Training dengan TRL SFTTrainer
# =============================================
# pip install trl
from transformers import TrainingArguments
from trl import SFTTrainer
# Training arguments
training_args = TrainingArguments(
output_dir="./results",
# Hyperparameters
num_train_epochs=3,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
gradient_accumulation_steps=4, # Effective batch = 4Γ4 = 16
learning_rate=2e-4,
weight_decay=0.01,
warmup_ratio=0.03,
lr_scheduler_type="cosine",
# Logging
logging_dir="./logs",
logging_steps=10,
# Save & Eval
save_strategy="steps",
save_steps=100,
eval_strategy="steps",
eval_steps=100,
save_total_limit=3,
load_best_model_at_end=True,
# Mixed precision
fp16=True, # atau bf16=True jika GPU support
# Gradient checkpointing (hemat VRAM)
gradient_checkpointing=True,
gradient_checkpointing_kwargs={"use_reentrant": False},
# Optimizer
optim="paged_adamw_8bit", # Untuk QLoRA
# Max sequence length
max_seq_length=2048,
# Report
report_to="none", # atau "wandb" untuk tracking
)
# Inisialisasi trainer
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset["test"],
tokenizer=tokenizer,
packing=True, # Pack multiple short sequences
)
# Mulai training!
print("Memulai fine-tuning...")
trainer.train()
# Simpan model
trainer.save_model("./fine-tuned-model")
tokenizer.save_pretrained("./fine-tuned-model")
print("Training selesai! Model disimpan di ./fine-tuned-model")
Hyperparameter Guide
| Parameter | Rekomendasi | Tips |
|---|---|---|
learning_rate | 1e-4 ~ 3e-4 | Lebih tinggi dari full FT |
epochs | 1-5 | Dataset kecil β epoch lebih banyak |
batch_size | 4-16 (effective) | Gunakan gradient accumulation |
warmup_ratio | 0.03-0.1 | Stabilkan training di awal |
LoRA rank | 16-64 | Tugas kompleks β rank lebih tinggi |
scheduler | cosine | Cosine atau linear |
7. Evaluasi Model
# =============================================
# Evaluasi Fine-tuned Model
# =============================================
from peft import PeftModel
from transformers import pipeline
# Load base model + adapter
base_model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "./fine-tuned-model")
# Inference
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
# Test prompts
test_prompts = [
"Jelaskan apa itu LoRA dalam 3 kalimat:",
"Apa perbedaan antara LoRA dan QLoRA?",
"Bagaimana cara menghitung VRAM yang dibutuhkan?",
]
for prompt in test_prompts:
result = pipe(
f"### Instruksi:\n{prompt}\n\n### Jawaban:\n",
max_new_tokens=200,
temperature=0.7,
do_sample=True,
top_p=0.9
)
print(f"Q: {prompt}")
print(f"A: {result[0]['generated_text'].split('### Jawaban:')[1][:300]}")
print("---")
# Perplexity evaluation
import math
eval_text = "LoRA adalah teknik parameter-efficient fine-tuning..."
inputs = tokenizer(eval_text, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model(**inputs, labels=inputs["input_ids"])
perplexity = math.exp(outputs.loss.item())
print(f"Perplexity: {perplexity:.2f}")
8. Merge & Export Model
# =============================================
# Merge LoRA adapter ke base model
# =============================================
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load base model (tanpa quantization untuk merge)
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
torch_dtype=torch.float16,
device_map="auto"
)
# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "./fine-tuned-model")
# Merge weights
merged_model = model.merge_and_unload()
# Save merged model
merged_model.save_pretrained("./merged-model")
tokenizer.save_pretrained("./merged-model")
# Export ke GGUF format (untuk llama.cpp / Ollama)
# pip install llama-cpp-python
# python convert_hf_to_gguf.py ./merged-model --outtype q4_k_m
# Push ke Hugging Face Hub
from huggingface_hub import login
login(token="hf_...")
merged_model.push_to_hub("username/llama-3.1-8b-finetuned")
tokenizer.push_to_hub("username/llama-3.1-8b-finetuned")
9. Deployment ke Produksi
# =============================================
# Deploy Fine-tuned Model dengan vLLM
# =============================================
# Option 1: vLLM Server (High Performance)
# pip install vllm
# CLI:
# vllm serve ./merged-model --port 8000 --gpu-memory-utilization 0.9
# Option 2: Ollama (Simple, Local)
# ollama create mymodel -f Modelfile
# Modelfile content:
# FROM ./merged-model-q4_k_m.gguf
# PARAMETER temperature 0.7
# SYSTEM "Anda adalah asisten AI..."
# Option 3: FastAPI + Transformers
from fastapi import FastAPI
from transformers import pipeline
app = FastAPI()
pipe = pipeline("text-generation", model="./merged-model")
@app.post("/generate")
async def generate(prompt: str):
result = pipe(prompt, max_new_tokens=500, temperature=0.7)
return {"text": result[0]["generated_text"]}
10. Quiz Pemahaman
1. Apa keunggulan utama LoRA dibanding full fine-tuning?
2. Apa yang dilakukan QLoRA yang berbeda dari LoRA biasa?
3. Apa fungsi dari parameter 'r' (rank) dalam LoRA?
4. Mengapa kualitas dataset sangat penting dalam fine-tuning?
5. Apa langkah setelah fine-tuning selesai sebelum deployment?
Rangkuman
- LoRA β adapter layers kecil yang dilatih di samping model beku, hemat parameter
- QLoRA β LoRA + 4-bit quantization, VRAM hanya ~6GB untuk model 7B
- PEFT β keluarga teknik parameter-efficient, termasuk LoRA, prefix tuning, prompt tuning
- Dataset β kualitas > kuantitas, format konsisten, validasi 10-20%
- Hyperparameters β LR 2e-4, rank 16-64, cosine scheduler
- Merge & Deploy β merge adapter β export β deploy (vLLM/Ollama)