1. Pengenalan NLP & Transformers
Natural Language Processing (NLP) adalah cabang AI yang memungkinkan komputer memahami, menginterpretasi, dan menghasilkan bahasa manusia. Sebelum Transformers, NLP menggunakan RNN dan LSTM yang lambat dan sulit menangkap konteks panjang.
Transformers (diperkenalkan tahun 2017 oleh Google dalam paper "Attention Is All You Need") merevolusi NLP dengan mekanisme self-attention yang bisa memproses seluruh sequence secara paralel.
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ EVOLUSI ARSITEKTUR NLP โ โ โ โ Era 1: Rule-based (1950-1990) โ โ โ Regex, grammar rules, dictionary โ โ โ โ Era 2: Statistical (1990-2013) โ โ โ Bag of Words, TF-IDF, Naive Bayes, SVM โ โ โ โ Era 3: Deep Learning (2013-2017) โ โ โ Word2Vec, GloVe, RNN, LSTM, GRU โ โ โ Vanishing gradient, sequential processing โ โ โ โ Era 4: Transformers (2017-sekarang) โ โ โ Attention mechanism, parallel processing โ โ โ BERT (2018), GPT-2 (2019), GPT-3 (2020) โ โ โ GPT-4 (2023), Llama 3 (2024) โ โ โ State-of-the-art di hampir semua task NLP! โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Mengapa Transformers Dominan?
| Fitur | RNN/LSTM | Transformers |
|---|---|---|
| Parallel processing | โ Sequential | โ Paralel penuh |
| Long-range dependencies | โ ๏ธ Sulit | โ Mudah (attention) |
| Training speed | Lambat | Cepat (dengan GPU) |
| Pre-training | Tidak umum | Transfer learning |
| Scalability | Terbatas | Sangat scalable |
2. Tokenization
Tokenization adalah proses mengubah teks menjadi token (unit kecil) yang bisa dipahami model. Berbeda dari split kata biasa, tokenizer modern menggunakan subword tokenization.
# =============================================
# Tokenization dengan Hugging Face
# =============================================
from transformers import AutoTokenizer
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Tokenisasi teks
text = "Transformers merevolusi NLP dengan attention mechanism!"
tokens = tokenizer(text)
print(f"Input IDs: {tokens['input_ids']}")
print(f"Attention Mask: {tokens['attention_mask']}")
# Decode kembali ke teks
decoded = tokenizer.decode(tokens['input_ids'])
print(f"Decoded: {decoded}")
# Lihat token individual
tokens_list = tokenizer.tokenize(text)
print(f"Tokens: {tokens_list}")
# ['transform', '##ers', 'mer', '##evo', '##lusi', 'nlp', 'dengan',
# 'attention', 'mechanism', '!']
# BERT menggunakan WordPiece: kata umum = 1 token, kata jarang dipecah
# "merevolusi" โ ["mer", "##evo", "##lusi"] (subword pieces)
# Batch tokenization
texts = ["Halo dunia", "NLP sangat menarik", "Belajar AI"]
batch = tokenizer(texts, padding=True, truncation=True, max_length=128,
return_tensors="pt")
print(f"Batch shape: {batch['input_ids'].shape}")
# Tokenizer untuk GPT (BPE - Byte Pair Encoding)
gpt_tokenizer = AutoTokenizer.from_pretrained("gpt2")
gpt_tokens = gpt_tokenizer.tokenize(text)
print(f"GPT tokens: {gpt_tokens}")
Jenis Tokenization
| Metode | Digunakan Oleh | Cara Kerja |
|---|---|---|
| WordPiece | BERT | Pecah kata jarang jadi subword |
| BPE (Byte Pair Encoding) | GPT, Llama | Merge karakter yang sering muncul |
| SentencePiece | T5, mBART | Language-agnostic, tanpa pre-tokenize |
| WordLevel | Model lama | Satu kata = satu token |
3. Attention Mechanism
Self-Attention adalah mekanisme inti Transformers. Setiap token "melihat" semua token lain dalam sequence dan memutuskan mana yang paling relevan untuk konteksnya.
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ SELF-ATTENTION MECHANISM โ โ โ โ Input: "Kucing duduk di atas tikar" โ โ โ โ Untuk setiap kata, hitung: โ โ 1. Query (Q): "Apa yang saya cari?" โ โ 2. Key (K): "Apa yang saya tawarkan?" โ โ 3. Value (V): "Informasi apa yang saya punya?" โ โ โ โ Attention(Q,K,V) = softmax(QยทK^T / โd_k) ยท V โ โ โ โ Kata "duduk" mencari: โ โ โ Kucing (siapa yang duduk?) โ attention tinggi โ โ โ tikar (duduk di mana?) โ attention tinggi โ โ โ di, atas โ attention medium (preposisi) โ โ โ โ Multi-Head Attention: โ โ โ Jalankan 8-12 attention paralel (heads) โ โ โ Setiap head fokus pada pola berbeda โ โ โ Gabungkan hasilnya โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# =============================================
# Visualisasi Attention Weights
# =============================================
from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, output_attentions=True)
text = "The cat sat on the mat"
inputs = tokenizer(text, return_tensors="pt")
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
with torch.no_grad():
outputs = model(**inputs)
# Ambil attention dari layer terakhir, head pertama
attention = outputs.attentions[-1][0, 0].numpy()
# Shape: (seq_len, seq_len)
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(8, 6))
im = ax.imshow(attention, cmap="Blues")
ax.set_xticks(range(len(tokens)))
ax.set_yticks(range(len(tokens)))
ax.set_xticklabels(tokens, rotation=45)
ax.set_yticklabels(tokens)
ax.set_title("Self-Attention Weights (Layer 12, Head 1)")
plt.colorbar(im)
plt.tight_layout()
plt.savefig("attention_heatmap.png", dpi=150)
print("Attention heatmap disimpan!")
# Multi-Head Attention
print(f"Jumlah layers: {len(outputs.attentions)}") # 12
print(f"Jumlah heads per layer: {outputs.attentions[0].shape[1]}") # 12
print(f"Sequence length: {outputs.attentions[0].shape[2]}")
4. Arsitektur BERT
BERT (Bidirectional Encoder Representations from Transformers) adalah model encoder-only yang dilatih dengan dua teknik: Masked Language Modeling (MLM) dan Next Sentence Prediction (NSP).
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ BERT (Encoder-only) GPT (Decoder-only) โ โ โ โ โโโโ[CLS] tok1 tok2 [MASK] tok1 tok2 tok3 โ [NEXT] โ โ โ โ โ โ โ โ โ โ โ โ โ โ โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโ โ โ โ โ Encoder โ โ Decoder โ โ โ โ โ (Bidirect.) โ โ (Causal) โ โ โ โ โ 12 layers โ โ 12+ layers โ โ โ โ โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโ โ โ โ โ โ โ โ โ โ โ โ โ โ [output embeddings] [next token pred] โ โ โ โ โ โ Untuk: klasifikasi, โ Untuk: text generation, โ โ NER, QA, similarity chatbot, summarization โ โ (understanding tasks) (generation tasks) โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# =============================================
# BERT untuk NLP Tasks
# =============================================
from transformers import pipeline
# ----- 1. Masked Language Model -----
fill_mask = pipeline("fill-mask", model="bert-base-uncased")
results = fill_mask("Transformers are [MASK] for NLP tasks.")
for r in results:
print(f" {r['token_str']}: {r['score']:.3f}")
# "good": 0.234, "used": 0.189, "great": 0.087, ...
# ----- 2. Sentence Embeddings -----
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
def get_embedding(text):
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
with torch.no_grad():
outputs = model(**inputs)
# Mean pooling
embedding = outputs.last_hidden_state.mean(dim=1)
return embedding[0]
emb1 = get_embedding("Saya suka belajar AI")
emb2 = get_embedding("AI sangat menarik untuk dipelajari")
emb3 = get_embedding("Resep nasi goreng enak")
# Cosine similarity
from torch.nn.functional import cosine_similarity
print(f"AI vs AI: {cosine_similarity(emb1, emb2):.3f}") # ~0.85
print(f"AI vs Food: {cosine_similarity(emb1, emb3):.3f}") # ~0.15
5. Arsitektur GPT
GPT (Generative Pre-trained Transformer) adalah model decoder-only yang dilatih untuk memprediksi token berikutnya. GPT unggul dalam text generation, conversation, dan creative tasks.
# =============================================
# Text Generation dengan GPT
# =============================================
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
# ----- Menggunakan pipeline -----
generator = pipeline("text-generation", model="gpt2")
result = generator(
"Artificial Intelligence di Indonesia",
max_length=100,
num_return_sequences=2,
temperature=0.7,
top_p=0.9,
do_sample=True
)
for i, r in enumerate(result):
print(f"Generated {i+1}: {r['generated_text']}")
# ----- GPT dengan Hugging Face -----
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
# Manual generation
input_text = "Machine learning adalah"
input_ids = tokenizer.encode(input_text, return_tensors="pt")
output = model.generate(
input_ids,
max_length=150,
temperature=0.7,
top_k=50,
top_p=0.95,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
generated = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated)
# ----- Menggunakan OpenAI API -----
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Anda adalah penulis teknis."},
{"role": "user", "content": "Jelaskan Transformers dalam 100 kata."}
],
temperature=0.7,
max_tokens=200
)
print(response.choices[0].message.content)
6. Hugging Face Ecosystem
Hugging Face adalah "GitHub untuk AI" โ platform utama untuk berbagi model, dataset, dan demo ML. HF Transformers library adalah standar industri untuk NLP.
# =============================================
# Hugging Face Ecosystem
# =============================================
# pip install transformers datasets evaluate huggingface_hub
# ----- 1. Pipeline (Paling Mudah) -----
from transformers import pipeline
# Sentiment Analysis
classifier = pipeline("sentiment-analysis", model="indobenchmark/indobert-base-p1")
result = classifier("Film ini sangat bagus dan menarik!")
print(result) # [{'label': 'POSITIVE', 'score': 0.98}]
# Text Summarization
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
summary = summarizer(long_text, max_length=100, min_length=30)
# Translation (Indonesian โ English)
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-id-en")
result = translator("Saya suka belajar pemrograman")
print(result) # [{'translation_text': 'I like learning programming'}]
# Named Entity Recognition
ner = pipeline("ner", model="dslim/bert-base-NER", grouped_entities=True)
entities = ner("Barack Obama lahir di Hawaii pada tahun 1961")
print(entities)
# Question Answering
qa = pipeline("question-answering", model="deepset/roberta-base-squad2")
result = qa(question="Siapa pendiri Tesla?", context="Tesla didirikan oleh Elon Musk...")
print(result) # {'answer': 'Elon Musk', 'score': 0.95}
# ----- 2. Load Dataset dari Hub -----
from datasets import load_dataset
dataset = load_dataset("imdb", split="train[:1000]")
print(dataset)
print(dataset[0]) # {'text': '...', 'label': 1}
# ----- 3. Push Model ke Hub -----
from huggingface_hub import login
login(token="hf_...")
# Push model dan tokenizer
model.push_to_hub("username/my-indonesian-sentiment-model")
tokenizer.push_to_hub("username/my-indonesian-sentiment-model")
7. Fine-tuning untuk NLP Tasks
# =============================================
# Fine-tuning BERT untuk Sentiment Analysis
# =============================================
from transformers import (
AutoTokenizer, AutoModelForSequenceClassification,
TrainingArguments, Trainer
)
from datasets import load_dataset
import evaluate
import numpy as np
# Load dataset
dataset = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
def tokenize(batch):
return tokenizer(batch["text"], padding=True, truncation=True, max_length=256)
tokenized = dataset.map(tokenize, batched=True)
tokenized = tokenized.rename_column("label", "labels")
# Load model
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased", num_labels=2
)
# Metrics
accuracy = evaluate.load("accuracy")
def compute_metrics(eval_pred):
logits, labels = eval_pred
preds = np.argmax(logits, axis=-1)
return accuracy.compute(predictions=preds, references=labels)
# Training
args = TrainingArguments(
output_dir="./bert-sentiment",
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=32,
learning_rate=2e-5,
weight_decay=0.01,
eval_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
report_to="none"
)
trainer = Trainer(
model=model, args=args,
train_dataset=tokenized["train"].shuffle(seed=42).select(range(5000)),
eval_dataset=tokenized["test"].shuffle(seed=42).select(range(1000)),
tokenizer=tokenizer,
compute_metrics=compute_metrics
)
trainer.train()
results = trainer.evaluate()
print(f"Accuracy: {results['eval_accuracy']:.4f}")
8. NLP Tasks Populer
| Task | Model | Pipeline Name |
|---|---|---|
| Text Classification | BERT, RoBERTa | text-classification |
| Named Entity Recognition | BERT-NER | ner |
| Question Answering | RoBERTa-SQuAD | question-answering |
| Summarization | BART, T5 | summarization |
| Translation | Marian, NLLB | translation |
| Text Generation | GPT-2, Llama | text-generation |
| Fill Mask | BERT | fill-mask |
| Semantic Similarity | Sentence-BERT | sentence-similarity |
9. Deployment
# =============================================
# Deploy NLP Model dengan FastAPI
# =============================================
from fastapi import FastAPI
from pydantic import BaseModel
from transformers import pipeline
app = FastAPI()
classifier = pipeline("sentiment-analysis", model="./bert-sentiment")
class TextInput(BaseModel):
text: str
@app.post("/predict")
async def predict(input: TextInput):
result = classifier(input.text)
return {"sentiment": result[0]["label"], "score": result[0]["score"]}
@app.post("/predict-batch")
async def predict_batch(texts: list[str]):
results = classifier(texts)
return {"results": results}
# Hugging Face Inference API (tanpa deploy)
# POST https://api-inference.huggingface.co/models/{model_id}
# Headers: {"Authorization": "Bearer hf_..."}
# Body: {"inputs": "teks Anda di sini"}
10. Quiz Pemahaman
1. Apa perbedaan utama BERT dan GPT?
2. Apa itu Self-Attention dalam Transformers?
3. Mengapa subword tokenization lebih baik dari word-level?
4. Apa fungsi dari Hugging Face Transformers library?
5. Apa keunggulan pre-training + fine-tuning dibanding training dari nol?
Rangkuman
- Transformers โ arsitektur revolusioner dengan self-attention, paralel processing
- BERT โ encoder-only, untuk understanding tasks (klasifikasi, NER, QA)
- GPT โ decoder-only, untuk generation tasks (chatbot, summarization)
- Tokenization โ subword (WordPiece, BPE) untuk menangani OOV
- Attention โ mekanisme Q, K, V untuk menentukan relevansi antar token
- Hugging Face โ ekosistem utama untuk load, train, deploy model NLP
- Fine-tuning โ transfer learning: pre-trained + data spesifik = performa tinggi