AI & Data Science

Fine-tuning LLM dengan LoRA/QLoRA β€” Adapter Layers, Quantization & Deployment

Pelajari cara fine-tuning Large Language Model secara efisien menggunakan LoRA dan QLoRA β€” dari konsep adapter layers, quantization, training configuration, evaluasi, hingga deployment ke produksi

1. Mengapa Fine-tuning LLM?

Large Language Model (LLM) seperti Llama, Mistral, dan GPT sudah dilatih dengan triliunan token. Namun, model umum ini belum tentu optimal untuk tugas spesifik Anda. Fine-tuning memungkinkan Anda mengadaptasi model untuk domain tertentu.

Kapan Perlu Fine-tuning?

Skenario Solusi Kompleksitas
Ingin model tahu domain spesifikFine-tuningMedium
Mau gaya output tertentu (formal, singkat)Fine-tuningMedium
Ada data eksklusif (dokumen perusahaan)RAG atau Fine-tuningMedium
Butuh akurasi tinggi di tugas spesifikFine-tuningMedium-High
Cuma butuh konteks tambahanRAG (tidak perlu fine-tune)Low
Budget GPU terbatasQLoRA / LoRAMedium
Diagram: Fine-tuning vs Prompt Engineering vs RAG
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚            PENDekatan ADAPTASI LLM                                  β”‚
β”‚                                                                     β”‚
β”‚  1. Prompt Engineering                                              β”‚
β”‚     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                  β”‚
β”‚     β”‚  System   │────▢│   LLM    │────▢ Output                    β”‚
β”‚     β”‚  Prompt   β”‚     β”‚ (frozen) β”‚                                  β”‚
β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                  β”‚
β”‚     βœ… Tidak butuh training                                        β”‚
β”‚     ❌ Terbatas oleh context window                                 β”‚
β”‚                                                                     β”‚
β”‚  2. RAG                                                             β”‚
β”‚     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                β”‚
β”‚     β”‚  Vector   │────▢│ Retrieval│────▢│   LLM    │──▢ Output    β”‚
β”‚     β”‚  Store    β”‚     β”‚  Chain   β”‚     β”‚ (frozen) β”‚                β”‚
β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                β”‚
β”‚     βœ… Data real-time, tidak perlu training                        β”‚
β”‚     ❌ Tidak mengubah gaya/kemampuan model                          β”‚
β”‚                                                                     β”‚
β”‚  3. Fine-tuning (LoRA/QLoRA)                                       β”‚
β”‚     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                      β”‚
β”‚     β”‚  Dataset  │────▢│  LLM + LoRA Adapter  │──▢ Fine-tuned LLM β”‚
β”‚     β”‚  Spesifik β”‚     β”‚  (hanya train adapter)β”‚                    β”‚
β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                      β”‚
β”‚     βœ… Mengubah gaya & kemampuan model                              β”‚
β”‚     βœ… Murah dengan LoRA/QLoRA                                      β”‚
β”‚     ❌ Butuh dataset berkualitas                                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

2. Full Fine-tuning vs Parameter-Efficient

Full fine-tuning memperbarui semua parameter model. Untuk model 7B parameter, ini butuh ~28GB VRAM (FP32) atau ~14GB (FP16). Sangat mahal!

Parameter-Efficient Fine-Tuning (PEFT) hanya melatih sebagian kecil parameter β€” biasanya kurang dari 1% β€” sambil mempertahankan kualitas yang mendekati full fine-tuning.

Perbandingan Metode PEFT

Metode Parameter Dilatih VRAM Kualitas
Full Fine-tuning100%~28GB (7B)⭐⭐⭐⭐⭐
LoRA0.1-1%~16GB (7B)⭐⭐⭐⭐
QLoRA0.1-1%~6GB (7B)⭐⭐⭐⭐
Prefix Tuning<0.1%~14GB⭐⭐⭐
Prompt Tuning<0.01%~14GB⭐⭐

3. LoRA β€” Low-Rank Adaptation

LoRA (Low-Rank Adaptation) adalah teknik yang memperkenalkan adapter layers kecil di samping layer existing model. Alih-alih mengubah semua weights, LoRA hanya melatih matriks kecil (rank decomposition) yang diinjeksikan ke layer attention.

Diagram: LoRA Architecture
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    LoRA ARCHITECTURE                              β”‚
β”‚                                                                  β”‚
β”‚  Original Layer (frozen):                                        β”‚
β”‚  Input ──▢ W (dΓ—d) ──▢ Output                                   β”‚
β”‚            (frozen)                                               β”‚
β”‚                                                                  β”‚
β”‚  With LoRA:                                                      β”‚
β”‚  Input ──┬── W (dΓ—d) ────────┬──▢ Output                       β”‚
β”‚          β”‚   (frozen)         β”‚                                   β”‚
β”‚          β”‚                    β”‚                                   β”‚
β”‚          └── A (dΓ—r) ── B (rΓ—d) ──▢ Ξ”Output                    β”‚
β”‚              (trainable!)     (trainable!)                        β”‚
β”‚                                                                  β”‚
β”‚  W_bar = W + Ξ± Γ— A Γ— B                                           β”‚
β”‚                                                                  β”‚
β”‚  d = model dimension (4096)                                      β”‚
β”‚  r = LoRA rank (8, 16, 32, 64)                                   β”‚
β”‚  Ξ± = scaling factor (biasanya 2Γ—rank)                             β”‚
β”‚                                                                  β”‚
β”‚  Contoh:                                                         β”‚
β”‚  d=4096, r=16:                                                   β”‚
β”‚  β†’ Full params: 4096 Γ— 4096 = 16.7M params                     β”‚
β”‚  β†’ LoRA params: 4096Γ—16 + 16Γ—4096 = 131K params (0.8%!)        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Python β€” Setup LoRA dengan PEFT
# =============================================
# LoRA Fine-tuning Setup
# =============================================
# pip install transformers peft accelerate bitsandbytes datasets

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
import torch

# Load base model
model_name = "meta-llama/Llama-3.1-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Konfigurasi LoRA
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                    # Rank (8, 16, 32, 64)
    lora_alpha=32,           # Scaling factor (biasanya 2Γ—rank)
    lora_dropout=0.05,       # Dropout untuk regularisasi
    target_modules=[         # Layer yang akan di-LoRA-kan
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    bias="none",
)

# Apply LoRA ke model
model = get_peft_model(model, lora_config)

# Print jumlah parameter yang dilatih
model.print_trainable_parameters()
# trainable params: 41,943,040 || all params: 8,072,204,288 || 
# trainable%: 0.5196

Parameter LoRA yang Penting

Parameter Nilai Umum Penjelasan
r (rank)8, 16, 32, 64Rank matriks β€” semakin tinggi, semakin banyak parameter
lora_alpha16, 32, 64Scaling factor, biasanya 2Γ— rank
lora_dropout0.05 - 0.1Dropout untuk mencegah overfitting
target_modules["q_proj", "v_proj"]Layer mana yang di-LoRA-kan
bias"none""none", "all", atau "lora_only"

4. QLoRA β€” Quantized LoRA

QLoRA adalah evolusi dari LoRA yang menambahkan quantization β€” model base dikuantisasi ke 4-bit, sementara adapter LoRA tetap di 16-bit. Hasilnya: Anda bisa fine-tune model 7B di GPU dengan hanya 6GB VRAM!

Diagram: QLoRA β€” Quantized Base + LoRA Adapters
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      QLoRA ARCHITECTURE                          β”‚
β”‚                                                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                            β”‚
β”‚  β”‚   Base Model (4-bit NF4)        β”‚ ← Frozen, dikuantisasi    β”‚
β”‚  β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚                            β”‚
β”‚  β”‚   β”‚  LlamaDecoderLayer      β”‚   β”‚                            β”‚
β”‚  β”‚   β”‚  β”Œβ”€β”€β”€β”  β”Œβ”€β”€β”€β”  β”Œβ”€β”€β”€β”   β”‚   β”‚                            β”‚
β”‚  β”‚   β”‚  β”‚Q  β”‚  β”‚K  β”‚  β”‚V  β”‚   β”‚   β”‚  4-bit NF4 quantization   β”‚
β”‚  β”‚   β”‚  β”‚projβ”‚  β”‚projβ”‚  β”‚projβ”‚   β”‚   β”‚                            β”‚
β”‚  β”‚   β”‚  β””β”€β”€β”€β”˜  β””β”€β”€β”€β”˜  β””β”€β”€β”€β”˜   β”‚   β”‚                            β”‚
β”‚  β”‚   β”‚  β”Œβ”€β”€β”€β”  β”Œβ”€β”€β”€β”          β”‚   β”‚                            β”‚
β”‚  β”‚   β”‚  β”‚LoRAβ”‚  β”‚LoRAβ”‚         β”‚   β”‚                            β”‚
β”‚  β”‚   β”‚  β”‚ A  β”‚  β”‚ B  β”‚  (16bit)β”‚   β”‚  ← LoRA adapters (FP16)  β”‚
β”‚  β”‚   β”‚  β””β”€β”€β”€β”˜  β””β”€β”€β”€β”˜          β”‚   β”‚                            β”‚
β”‚  β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚                            β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                            β”‚
β”‚                                                                  β”‚
β”‚  VRAM Comparison (7B model):                                    β”‚
β”‚  Full FT (FP32):   ~28 GB  ❌                                   β”‚
β”‚  Full FT (FP16):   ~14 GB  ⚠️                                  β”‚
β”‚  LoRA (FP16):      ~16 GB  ⚠️                                  β”‚
β”‚  QLoRA (4-bit):    ~6 GB   βœ… (RTX 3060 bisa!)                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Python β€” QLoRA Setup dengan BitsAndBytes
# =============================================
# QLoRA β€” Quantized LoRA
# =============================================
from transformers import (
    AutoModelForCausalLM, 
    AutoTokenizer,
    BitsAndBytesConfig
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch

# 1. Konfigurasi Quantization (4-bit NF4)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",           # Normalized Float 4-bit
    bnb_4bit_compute_dtype=torch.float16, # Compute dtype
    bnb_4bit_use_double_quant=True,       # Double quantization
)

# 2. Load model dengan quantization
model_name = "meta-llama/Llama-3.1-8B"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# 3. Prepare model untuk k-bit training
model = prepare_model_for_kbit_training(model)

# 4. Tambahkan LoRA adapters
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable: ~41M / all: ~8B = 0.5%

# Cek VRAM usage
print(f"VRAM: {torch.cuda.memory_allocated()/1e9:.1f} GB")
# Sekitar 5-6 GB untuk model 7B!

5. Persiapan Dataset

Kualitas dataset adalah faktor paling penting dalam fine-tuning. Format data tergantung tugas: instruction-following, chat, atau completion.

Python β€” Dataset Preparation
# =============================================
# Persiapan Dataset untuk Fine-tuning
# =============================================
from datasets import load_dataset, Dataset

# ----- Format 1: Instruction (Alpaca format) -----
dataset = load_dataset("json", data_files="data/instructions.json")

# Contoh format Alpaca:
# {
#   "instruction": "Jelaskan apa itu RAG",
#   "input": "",
#   "output": "RAG (Retrieval Augmented Generation) adalah..."
# }

def format_alpaca(example):
    if example.get("input", ""):
        text = f"""### Instruksi:
{example['instruction']}

### Input:
{example['input']}

### Jawaban:
{example['output']}{tokenizer.eos_token}"""
    else:
        text = f"""### Instruksi:
{example['instruction']}

### Jawaban:
{example['output']}{tokenizer.eos_token}"""
    return {"text": text}

dataset = dataset.map(format_alpaca)

# ----- Format 2: Chat format -----
def format_chat(example):
    messages = [
        {"role": "system", "content": "Anda adalah asisten AI yang membantu."},
        {"role": "user", "content": example["question"]},
        {"role": "assistant", "content": example["answer"]}
    ]
    text = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=False
    )
    return {"text": text}

# ----- Format 3: Custom dataset dari CSV -----
import pandas as pd
df = pd.read_csv("data/training_data.csv")
dataset = Dataset.from_pandas(df)
dataset = dataset.train_test_split(test_size=0.1, seed=42)

# Tokenize dataset
def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=2048,
        padding="max_length"
    )

tokenized_dataset = dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=dataset["train"].column_names
)

print(f"Train: {len(tokenized_dataset['train'])} samples")
print(f"Test: {len(tokenized_dataset['test'])} samples")
πŸ’‘ Tips Dataset untuk Fine-tuning
  • Kualitas > Kuantitas β€” 1000 contoh berkualitas lebih baik dari 10.000 contoh berisik
  • Konsistensi format β€” semua data harus mengikuti format yang sama
  • Diversitas β€” variasi cara bertanya untuk topik yang sama
  • Validasi β€” selalu simpan 10-20% data untuk evaluasi
  • Bahasa Indonesia β€” gunakan dataset bilingual untuk model multilingual

6. Training Configuration & Execution

Python β€” Training dengan SFTTrainer
# =============================================
# Training dengan TRL SFTTrainer
# =============================================
# pip install trl

from transformers import TrainingArguments
from trl import SFTTrainer

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    
    # Hyperparameters
    num_train_epochs=3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,  # Effective batch = 4Γ—4 = 16
    learning_rate=2e-4,
    weight_decay=0.01,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    
    # Logging
    logging_dir="./logs",
    logging_steps=10,
    
    # Save & Eval
    save_strategy="steps",
    save_steps=100,
    eval_strategy="steps",
    eval_steps=100,
    save_total_limit=3,
    load_best_model_at_end=True,
    
    # Mixed precision
    fp16=True,  # atau bf16=True jika GPU support
    
    # Gradient checkpointing (hemat VRAM)
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False},
    
    # Optimizer
    optim="paged_adamw_8bit",  # Untuk QLoRA
    
    # Max sequence length
    max_seq_length=2048,
    
    # Report
    report_to="none",  # atau "wandb" untuk tracking
)

# Inisialisasi trainer
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    packing=True,  # Pack multiple short sequences
)

# Mulai training!
print("Memulai fine-tuning...")
trainer.train()

# Simpan model
trainer.save_model("./fine-tuned-model")
tokenizer.save_pretrained("./fine-tuned-model")
print("Training selesai! Model disimpan di ./fine-tuned-model")

Hyperparameter Guide

Parameter Rekomendasi Tips
learning_rate1e-4 ~ 3e-4Lebih tinggi dari full FT
epochs1-5Dataset kecil β†’ epoch lebih banyak
batch_size4-16 (effective)Gunakan gradient accumulation
warmup_ratio0.03-0.1Stabilkan training di awal
LoRA rank16-64Tugas kompleks β†’ rank lebih tinggi
schedulercosineCosine atau linear

7. Evaluasi Model

Python β€” Evaluasi Model
# =============================================
# Evaluasi Fine-tuned Model
# =============================================
from peft import PeftModel
from transformers import pipeline

# Load base model + adapter
base_model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    torch_dtype=torch.float16,
    device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "./fine-tuned-model")

# Inference
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Test prompts
test_prompts = [
    "Jelaskan apa itu LoRA dalam 3 kalimat:",
    "Apa perbedaan antara LoRA dan QLoRA?",
    "Bagaimana cara menghitung VRAM yang dibutuhkan?",
]

for prompt in test_prompts:
    result = pipe(
        f"### Instruksi:\n{prompt}\n\n### Jawaban:\n",
        max_new_tokens=200,
        temperature=0.7,
        do_sample=True,
        top_p=0.9
    )
    print(f"Q: {prompt}")
    print(f"A: {result[0]['generated_text'].split('### Jawaban:')[1][:300]}")
    print("---")

# Perplexity evaluation
import math
eval_text = "LoRA adalah teknik parameter-efficient fine-tuning..."
inputs = tokenizer(eval_text, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model(**inputs, labels=inputs["input_ids"])
    perplexity = math.exp(outputs.loss.item())
    print(f"Perplexity: {perplexity:.2f}")

8. Merge & Export Model

Python β€” Merge LoRA ke Base Model
# =============================================
# Merge LoRA adapter ke base model
# =============================================
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load base model (tanpa quantization untuk merge)
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "./fine-tuned-model")

# Merge weights
merged_model = model.merge_and_unload()

# Save merged model
merged_model.save_pretrained("./merged-model")
tokenizer.save_pretrained("./merged-model")

# Export ke GGUF format (untuk llama.cpp / Ollama)
# pip install llama-cpp-python
# python convert_hf_to_gguf.py ./merged-model --outtype q4_k_m

# Push ke Hugging Face Hub
from huggingface_hub import login
login(token="hf_...")

merged_model.push_to_hub("username/llama-3.1-8b-finetuned")
tokenizer.push_to_hub("username/llama-3.1-8b-finetuned")

9. Deployment ke Produksi

Python β€” Deploy dengan vLLM
# =============================================
# Deploy Fine-tuned Model dengan vLLM
# =============================================

# Option 1: vLLM Server (High Performance)
# pip install vllm

# CLI:
# vllm serve ./merged-model --port 8000 --gpu-memory-utilization 0.9

# Option 2: Ollama (Simple, Local)
# ollama create mymodel -f Modelfile
# Modelfile content:
# FROM ./merged-model-q4_k_m.gguf
# PARAMETER temperature 0.7
# SYSTEM "Anda adalah asisten AI..."

# Option 3: FastAPI + Transformers
from fastapi import FastAPI
from transformers import pipeline

app = FastAPI()
pipe = pipeline("text-generation", model="./merged-model")

@app.post("/generate")
async def generate(prompt: str):
    result = pipe(prompt, max_new_tokens=500, temperature=0.7)
    return {"text": result[0]["generated_text"]}

10. Quiz Pemahaman

1. Apa keunggulan utama LoRA dibanding full fine-tuning?

2. Apa yang dilakukan QLoRA yang berbeda dari LoRA biasa?

3. Apa fungsi dari parameter 'r' (rank) dalam LoRA?

4. Mengapa kualitas dataset sangat penting dalam fine-tuning?

5. Apa langkah setelah fine-tuning selesai sebelum deployment?

Rangkuman

πŸ“ Poin Penting
  • LoRA β€” adapter layers kecil yang dilatih di samping model beku, hemat parameter
  • QLoRA β€” LoRA + 4-bit quantization, VRAM hanya ~6GB untuk model 7B
  • PEFT β€” keluarga teknik parameter-efficient, termasuk LoRA, prefix tuning, prompt tuning
  • Dataset β€” kualitas > kuantitas, format konsisten, validasi 10-20%
  • Hyperparameters β€” LR 2e-4, rank 16-64, cosine scheduler
  • Merge & Deploy β€” merge adapter β†’ export β†’ deploy (vLLM/Ollama)