NLP dengan Transformers — BERT, GPT, Hugging Face

📋 Daftar Isi

Pengenalan NLP & Transformers
Tokenization
Attention Mechanism
Arsitektur BERT
Arsitektur GPT
Hugging Face Ecosystem
Fine-tuning untuk NLP Tasks
NLP Tasks Populer
Deployment
Quiz Pemahaman

1. Pengenalan NLP & Transformers

Natural Language Processing (NLP) adalah cabang AI yang memungkinkan komputer memahami, menginterpretasi, dan menghasilkan bahasa manusia. Sebelum Transformers, NLP menggunakan RNN dan LSTM yang lambat dan sulit menangkap konteks panjang.

Transformers (diperkenalkan tahun 2017 oleh Google dalam paper "Attention Is All You Need") merevolusi NLP dengan mekanisme self-attention yang bisa memproses seluruh sequence secara paralel.

Diagram: Evolusi Arsitektur NLP

┌─────────────────────────────────────────────────────────────────┐
│              EVOLUSI ARSITEKTUR NLP                              │
│                                                                  │
│  Era 1: Rule-based (1950-1990)                                  │
│  → Regex, grammar rules, dictionary                             │
│                                                                  │
│  Era 2: Statistical (1990-2013)                                 │
│  → Bag of Words, TF-IDF, Naive Bayes, SVM                      │
│                                                                  │
│  Era 3: Deep Learning (2013-2017)                               │
│  → Word2Vec, GloVe, RNN, LSTM, GRU                             │
│  ❌ Vanishing gradient, sequential processing                   │
│                                                                  │
│  Era 4: Transformers (2017-sekarang)                            │
│  → Attention mechanism, parallel processing                     │
│  → BERT (2018), GPT-2 (2019), GPT-3 (2020)                    │
│  → GPT-4 (2023), Llama 3 (2024)                                │
│  ✅ State-of-the-art di hampir semua task NLP!                  │
└─────────────────────────────────────────────────────────────────┘

Mengapa Transformers Dominan?

Fitur	RNN/LSTM	Transformers
Parallel processing	❌ Sequential	✅ Paralel penuh
Long-range dependencies	⚠️ Sulit	✅ Mudah (attention)
Training speed	Lambat	Cepat (dengan GPU)
Pre-training	Tidak umum	Transfer learning
Scalability	Terbatas	Sangat scalable

2. Tokenization

Tokenization adalah proses mengubah teks menjadi token (unit kecil) yang bisa dipahami model. Berbeda dari split kata biasa, tokenizer modern menggunakan subword tokenization.

Python — Tokenization dengan Hugging Face

# =============================================
# Tokenization dengan Hugging Face
# =============================================
from transformers import AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Tokenisasi teks
text = "Transformers merevolusi NLP dengan attention mechanism!"
tokens = tokenizer(text)
print(f"Input IDs: {tokens['input_ids']}")
print(f"Attention Mask: {tokens['attention_mask']}")

# Decode kembali ke teks
decoded = tokenizer.decode(tokens['input_ids'])
print(f"Decoded: {decoded}")

# Lihat token individual
tokens_list = tokenizer.tokenize(text)
print(f"Tokens: {tokens_list}")
# ['transform', '##ers', 'mer', '##evo', '##lusi', 'nlp', 'dengan', 
#  'attention', 'mechanism', '!']

# BERT menggunakan WordPiece: kata umum = 1 token, kata jarang dipecah
# "merevolusi" → ["mer", "##evo", "##lusi"] (subword pieces)

# Batch tokenization
texts = ["Halo dunia", "NLP sangat menarik", "Belajar AI"]
batch = tokenizer(texts, padding=True, truncation=True, max_length=128,
                  return_tensors="pt")
print(f"Batch shape: {batch['input_ids'].shape}")

# Tokenizer untuk GPT (BPE - Byte Pair Encoding)
gpt_tokenizer = AutoTokenizer.from_pretrained("gpt2")
gpt_tokens = gpt_tokenizer.tokenize(text)
print(f"GPT tokens: {gpt_tokens}")

Jenis Tokenization

Metode	Digunakan Oleh	Cara Kerja
WordPiece	BERT	Pecah kata jarang jadi subword
BPE (Byte Pair Encoding)	GPT, Llama	Merge karakter yang sering muncul
SentencePiece	T5, mBART	Language-agnostic, tanpa pre-tokenize
WordLevel	Model lama	Satu kata = satu token

3. Attention Mechanism

Self-Attention adalah mekanisme inti Transformers. Setiap token "melihat" semua token lain dalam sequence dan memutuskan mana yang paling relevan untuk konteksnya.

Diagram: Self-Attention

┌─────────────────────────────────────────────────────────────────┐
│                 SELF-ATTENTION MECHANISM                         │
│                                                                  │
│  Input: "Kucing duduk di atas tikar"                            │
│                                                                  │
│  Untuk setiap kata, hitung:                                     │
│  1. Query (Q): "Apa yang saya cari?"                            │
│  2. Key (K):   "Apa yang saya tawarkan?"                        │
│  3. Value (V): "Informasi apa yang saya punya?"                 │
│                                                                  │
│  Attention(Q,K,V) = softmax(Q·K^T / √d_k) · V                 │
│                                                                  │
│  Kata "duduk" mencari:                                          │
│  → Kucing (siapa yang duduk?) → attention tinggi                │
│  → tikar (duduk di mana?) → attention tinggi                    │
│  → di, atas → attention medium (preposisi)                      │
│                                                                  │
│  Multi-Head Attention:                                          │
│  → Jalankan 8-12 attention paralel (heads)                      │
│  → Setiap head fokus pada pola berbeda                          │
│  → Gabungkan hasilnya                                           │
└─────────────────────────────────────────────────────────────────┘

Python — Visualisasi Attention

# =============================================
# Visualisasi Attention Weights
# =============================================
from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np

model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, output_attentions=True)

text = "The cat sat on the mat"
inputs = tokenizer(text, return_tensors="pt")
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

with torch.no_grad():
    outputs = model(**inputs)

# Ambil attention dari layer terakhir, head pertama
attention = outputs.attentions[-1][0, 0].numpy()
# Shape: (seq_len, seq_len)

import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(8, 6))
im = ax.imshow(attention, cmap="Blues")
ax.set_xticks(range(len(tokens)))
ax.set_yticks(range(len(tokens)))
ax.set_xticklabels(tokens, rotation=45)
ax.set_yticklabels(tokens)
ax.set_title("Self-Attention Weights (Layer 12, Head 1)")
plt.colorbar(im)
plt.tight_layout()
plt.savefig("attention_heatmap.png", dpi=150)
print("Attention heatmap disimpan!")

# Multi-Head Attention
print(f"Jumlah layers: {len(outputs.attentions)}")  # 12
print(f"Jumlah heads per layer: {outputs.attentions[0].shape[1]}")  # 12
print(f"Sequence length: {outputs.attentions[0].shape[2]}")

4. Arsitektur BERT

BERT (Bidirectional Encoder Representations from Transformers) adalah model encoder-only yang dilatih dengan dua teknik: Masked Language Modeling (MLM) dan Next Sentence Prediction (NSP).

Diagram: BERT vs GPT Architecture

┌─────────────────────────────────────────────────────────────────┐
│  BERT (Encoder-only)          GPT (Decoder-only)               │
│                                                                  │
│  ┌───[CLS] tok1 tok2 [MASK]   tok1 tok2 tok3 → [NEXT]        │
│  │      ↓    ↓    ↓    ↓         ↓    ↓    ↓     ↓             │
│  │     ╔═══════════════╗        ╔═══════════════╗              │
│  │     ║   Encoder     ║        ║   Decoder     ║              │
│  │     ║  (Bidirect.)  ║        ║  (Causal)     ║              │
│  │     ║  12 layers    ║        ║  12+ layers   ║              │
│  │     ╚═══════════════╝        ╚═══════════════╝              │
│  │      ↓    ↓    ↓    ↓         ↓    ↓    ↓                   │
│  │   [output embeddings]       [next token pred]               │
│  │                                                             │
│  → Untuk: klasifikasi,           → Untuk: text generation,     │
│    NER, QA, similarity             chatbot, summarization      │
│    (understanding tasks)            (generation tasks)         │
└─────────────────────────────────────────────────────────────────┘

Python — BERT untuk Berbagai Tasks

# =============================================
# BERT untuk NLP Tasks
# =============================================
from transformers import pipeline

# ----- 1. Masked Language Model -----
fill_mask = pipeline("fill-mask", model="bert-base-uncased")
results = fill_mask("Transformers are [MASK] for NLP tasks.")
for r in results:
    print(f"  {r['token_str']}: {r['score']:.3f}")
# "good": 0.234, "used": 0.189, "great": 0.087, ...

# ----- 2. Sentence Embeddings -----
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

def get_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
    with torch.no_grad():
        outputs = model(**inputs)
    # Mean pooling
    embedding = outputs.last_hidden_state.mean(dim=1)
    return embedding[0]

emb1 = get_embedding("Saya suka belajar AI")
emb2 = get_embedding("AI sangat menarik untuk dipelajari")
emb3 = get_embedding("Resep nasi goreng enak")

# Cosine similarity
from torch.nn.functional import cosine_similarity
print(f"AI vs AI: {cosine_similarity(emb1, emb2):.3f}")   # ~0.85
print(f"AI vs Food: {cosine_similarity(emb1, emb3):.3f}") # ~0.15

5. Arsitektur GPT

GPT (Generative Pre-trained Transformer) adalah model decoder-only yang dilatih untuk memprediksi token berikutnya. GPT unggul dalam text generation, conversation, dan creative tasks.

Python — Text Generation dengan GPT

# =============================================
# Text Generation dengan GPT
# =============================================
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM

# ----- Menggunakan pipeline -----
generator = pipeline("text-generation", model="gpt2")
result = generator(
    "Artificial Intelligence di Indonesia",
    max_length=100,
    num_return_sequences=2,
    temperature=0.7,
    top_p=0.9,
    do_sample=True
)
for i, r in enumerate(result):
    print(f"Generated {i+1}: {r['generated_text']}")

# ----- GPT dengan Hugging Face -----
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

# Manual generation
input_text = "Machine learning adalah"
input_ids = tokenizer.encode(input_text, return_tensors="pt")

output = model.generate(
    input_ids,
    max_length=150,
    temperature=0.7,
    top_k=50,
    top_p=0.95,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)
generated = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated)

# ----- Menggunakan OpenAI API -----
from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Anda adalah penulis teknis."},
        {"role": "user", "content": "Jelaskan Transformers dalam 100 kata."}
    ],
    temperature=0.7,
    max_tokens=200
)
print(response.choices[0].message.content)

6. Hugging Face Ecosystem

Hugging Face adalah "GitHub untuk AI" — platform utama untuk berbagi model, dataset, dan demo ML. HF Transformers library adalah standar industri untuk NLP.

Python — Hugging Face Essentials

# =============================================
# Hugging Face Ecosystem
# =============================================
# pip install transformers datasets evaluate huggingface_hub

# ----- 1. Pipeline (Paling Mudah) -----
from transformers import pipeline

# Sentiment Analysis
classifier = pipeline("sentiment-analysis", model="indobenchmark/indobert-base-p1")
result = classifier("Film ini sangat bagus dan menarik!")
print(result)  # [{'label': 'POSITIVE', 'score': 0.98}]

# Text Summarization
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
summary = summarizer(long_text, max_length=100, min_length=30)

# Translation (Indonesian → English)
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-id-en")
result = translator("Saya suka belajar pemrograman")
print(result)  # [{'translation_text': 'I like learning programming'}]

# Named Entity Recognition
ner = pipeline("ner", model="dslim/bert-base-NER", grouped_entities=True)
entities = ner("Barack Obama lahir di Hawaii pada tahun 1961")
print(entities)

# Question Answering
qa = pipeline("question-answering", model="deepset/roberta-base-squad2")
result = qa(question="Siapa pendiri Tesla?", context="Tesla didirikan oleh Elon Musk...")
print(result)  # {'answer': 'Elon Musk', 'score': 0.95}

# ----- 2. Load Dataset dari Hub -----
from datasets import load_dataset
dataset = load_dataset("imdb", split="train[:1000]")
print(dataset)
print(dataset[0])  # {'text': '...', 'label': 1}

# ----- 3. Push Model ke Hub -----
from huggingface_hub import login
login(token="hf_...")

# Push model dan tokenizer
model.push_to_hub("username/my-indonesian-sentiment-model")
tokenizer.push_to_hub("username/my-indonesian-sentiment-model")

7. Fine-tuning untuk NLP Tasks

Python — Fine-tuning BERT untuk Sentiment Analysis

# =============================================
# Fine-tuning BERT untuk Sentiment Analysis
# =============================================
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    TrainingArguments, Trainer
)
from datasets import load_dataset
import evaluate
import numpy as np

# Load dataset
dataset = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def tokenize(batch):
    return tokenizer(batch["text"], padding=True, truncation=True, max_length=256)

tokenized = dataset.map(tokenize, batched=True)
tokenized = tokenized.rename_column("label", "labels")

# Load model
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=2
)

# Metrics
accuracy = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return accuracy.compute(predictions=preds, references=labels)

# Training
args = TrainingArguments(
    output_dir="./bert-sentiment",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    learning_rate=2e-5,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to="none"
)

trainer = Trainer(
    model=model, args=args,
    train_dataset=tokenized["train"].shuffle(seed=42).select(range(5000)),
    eval_dataset=tokenized["test"].shuffle(seed=42).select(range(1000)),
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()
results = trainer.evaluate()
print(f"Accuracy: {results['eval_accuracy']:.4f}")

8. NLP Tasks Populer

Task	Model	Pipeline Name
Text Classification	BERT, RoBERTa	`text-classification`
Named Entity Recognition	BERT-NER	`ner`
Question Answering	RoBERTa-SQuAD	`question-answering`
Summarization	BART, T5	`summarization`
Translation	Marian, NLLB	`translation`
Text Generation	GPT-2, Llama	`text-generation`
Fill Mask	BERT	`fill-mask`
Semantic Similarity	Sentence-BERT	`sentence-similarity`

9. Deployment

Python — Deploy NLP Model

# =============================================
# Deploy NLP Model dengan FastAPI
# =============================================
from fastapi import FastAPI
from pydantic import BaseModel
from transformers import pipeline

app = FastAPI()
classifier = pipeline("sentiment-analysis", model="./bert-sentiment")

class TextInput(BaseModel):
    text: str

@app.post("/predict")
async def predict(input: TextInput):
    result = classifier(input.text)
    return {"sentiment": result[0]["label"], "score": result[0]["score"]}

@app.post("/predict-batch")
async def predict_batch(texts: list[str]):
    results = classifier(texts)
    return {"results": results}

# Hugging Face Inference API (tanpa deploy)
# POST https://api-inference.huggingface.co/models/{model_id}
# Headers: {"Authorization": "Bearer hf_..."}
# Body: {"inputs": "teks Anda di sini"}

10. Quiz Pemahaman

Rangkuman

📝 Poin Penting

Transformers — arsitektur revolusioner dengan self-attention, paralel processing
BERT — encoder-only, untuk understanding tasks (klasifikasi, NER, QA)
GPT — decoder-only, untuk generation tasks (chatbot, summarization)
Tokenization — subword (WordPiece, BPE) untuk menangani OOV
Attention — mekanisme Q, K, V untuk menentukan relevansi antar token
Hugging Face — ekosistem utama untuk load, train, deploy model NLP
Fine-tuning — transfer learning: pre-trained + data spesifik = performa tinggi

NLP dengan Transformers — BERT, GPT, Attention & Hugging Face