AI & Data Science

NLP dengan Transformers โ€” BERT, GPT, Attention & Hugging Face

Tutorial lengkap Natural Language Processing modern dengan arsitektur Transformers โ€” dari konsep attention, tokenization, arsitektur BERT dan GPT, fine-tuning, hingga deployment dengan Hugging Face

1. Pengenalan NLP & Transformers

Natural Language Processing (NLP) adalah cabang AI yang memungkinkan komputer memahami, menginterpretasi, dan menghasilkan bahasa manusia. Sebelum Transformers, NLP menggunakan RNN dan LSTM yang lambat dan sulit menangkap konteks panjang.

Transformers (diperkenalkan tahun 2017 oleh Google dalam paper "Attention Is All You Need") merevolusi NLP dengan mekanisme self-attention yang bisa memproses seluruh sequence secara paralel.

Diagram: Evolusi Arsitektur NLP
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚              EVOLUSI ARSITEKTUR NLP                              โ”‚
โ”‚                                                                  โ”‚
โ”‚  Era 1: Rule-based (1950-1990)                                  โ”‚
โ”‚  โ†’ Regex, grammar rules, dictionary                             โ”‚
โ”‚                                                                  โ”‚
โ”‚  Era 2: Statistical (1990-2013)                                 โ”‚
โ”‚  โ†’ Bag of Words, TF-IDF, Naive Bayes, SVM                      โ”‚
โ”‚                                                                  โ”‚
โ”‚  Era 3: Deep Learning (2013-2017)                               โ”‚
โ”‚  โ†’ Word2Vec, GloVe, RNN, LSTM, GRU                             โ”‚
โ”‚  โŒ Vanishing gradient, sequential processing                   โ”‚
โ”‚                                                                  โ”‚
โ”‚  Era 4: Transformers (2017-sekarang)                            โ”‚
โ”‚  โ†’ Attention mechanism, parallel processing                     โ”‚
โ”‚  โ†’ BERT (2018), GPT-2 (2019), GPT-3 (2020)                    โ”‚
โ”‚  โ†’ GPT-4 (2023), Llama 3 (2024)                                โ”‚
โ”‚  โœ… State-of-the-art di hampir semua task NLP!                  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Mengapa Transformers Dominan?

FiturRNN/LSTMTransformers
Parallel processingโŒ Sequentialโœ… Paralel penuh
Long-range dependenciesโš ๏ธ Sulitโœ… Mudah (attention)
Training speedLambatCepat (dengan GPU)
Pre-trainingTidak umumTransfer learning
ScalabilityTerbatasSangat scalable

2. Tokenization

Tokenization adalah proses mengubah teks menjadi token (unit kecil) yang bisa dipahami model. Berbeda dari split kata biasa, tokenizer modern menggunakan subword tokenization.

Python โ€” Tokenization dengan Hugging Face
# =============================================
# Tokenization dengan Hugging Face
# =============================================
from transformers import AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Tokenisasi teks
text = "Transformers merevolusi NLP dengan attention mechanism!"
tokens = tokenizer(text)
print(f"Input IDs: {tokens['input_ids']}")
print(f"Attention Mask: {tokens['attention_mask']}")

# Decode kembali ke teks
decoded = tokenizer.decode(tokens['input_ids'])
print(f"Decoded: {decoded}")

# Lihat token individual
tokens_list = tokenizer.tokenize(text)
print(f"Tokens: {tokens_list}")
# ['transform', '##ers', 'mer', '##evo', '##lusi', 'nlp', 'dengan', 
#  'attention', 'mechanism', '!']

# BERT menggunakan WordPiece: kata umum = 1 token, kata jarang dipecah
# "merevolusi" โ†’ ["mer", "##evo", "##lusi"] (subword pieces)

# Batch tokenization
texts = ["Halo dunia", "NLP sangat menarik", "Belajar AI"]
batch = tokenizer(texts, padding=True, truncation=True, max_length=128,
                  return_tensors="pt")
print(f"Batch shape: {batch['input_ids'].shape}")

# Tokenizer untuk GPT (BPE - Byte Pair Encoding)
gpt_tokenizer = AutoTokenizer.from_pretrained("gpt2")
gpt_tokens = gpt_tokenizer.tokenize(text)
print(f"GPT tokens: {gpt_tokens}")

Jenis Tokenization

MetodeDigunakan OlehCara Kerja
WordPieceBERTPecah kata jarang jadi subword
BPE (Byte Pair Encoding)GPT, LlamaMerge karakter yang sering muncul
SentencePieceT5, mBARTLanguage-agnostic, tanpa pre-tokenize
WordLevelModel lamaSatu kata = satu token

3. Attention Mechanism

Self-Attention adalah mekanisme inti Transformers. Setiap token "melihat" semua token lain dalam sequence dan memutuskan mana yang paling relevan untuk konteksnya.

Diagram: Self-Attention
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                 SELF-ATTENTION MECHANISM                         โ”‚
โ”‚                                                                  โ”‚
โ”‚  Input: "Kucing duduk di atas tikar"                            โ”‚
โ”‚                                                                  โ”‚
โ”‚  Untuk setiap kata, hitung:                                     โ”‚
โ”‚  1. Query (Q): "Apa yang saya cari?"                            โ”‚
โ”‚  2. Key (K):   "Apa yang saya tawarkan?"                        โ”‚
โ”‚  3. Value (V): "Informasi apa yang saya punya?"                 โ”‚
โ”‚                                                                  โ”‚
โ”‚  Attention(Q,K,V) = softmax(QยทK^T / โˆšd_k) ยท V                 โ”‚
โ”‚                                                                  โ”‚
โ”‚  Kata "duduk" mencari:                                          โ”‚
โ”‚  โ†’ Kucing (siapa yang duduk?) โ†’ attention tinggi                โ”‚
โ”‚  โ†’ tikar (duduk di mana?) โ†’ attention tinggi                    โ”‚
โ”‚  โ†’ di, atas โ†’ attention medium (preposisi)                      โ”‚
โ”‚                                                                  โ”‚
โ”‚  Multi-Head Attention:                                          โ”‚
โ”‚  โ†’ Jalankan 8-12 attention paralel (heads)                      โ”‚
โ”‚  โ†’ Setiap head fokus pada pola berbeda                          โ”‚
โ”‚  โ†’ Gabungkan hasilnya                                           โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
Python โ€” Visualisasi Attention
# =============================================
# Visualisasi Attention Weights
# =============================================
from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np

model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, output_attentions=True)

text = "The cat sat on the mat"
inputs = tokenizer(text, return_tensors="pt")
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

with torch.no_grad():
    outputs = model(**inputs)

# Ambil attention dari layer terakhir, head pertama
attention = outputs.attentions[-1][0, 0].numpy()
# Shape: (seq_len, seq_len)

import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(8, 6))
im = ax.imshow(attention, cmap="Blues")
ax.set_xticks(range(len(tokens)))
ax.set_yticks(range(len(tokens)))
ax.set_xticklabels(tokens, rotation=45)
ax.set_yticklabels(tokens)
ax.set_title("Self-Attention Weights (Layer 12, Head 1)")
plt.colorbar(im)
plt.tight_layout()
plt.savefig("attention_heatmap.png", dpi=150)
print("Attention heatmap disimpan!")

# Multi-Head Attention
print(f"Jumlah layers: {len(outputs.attentions)}")  # 12
print(f"Jumlah heads per layer: {outputs.attentions[0].shape[1]}")  # 12
print(f"Sequence length: {outputs.attentions[0].shape[2]}")

4. Arsitektur BERT

BERT (Bidirectional Encoder Representations from Transformers) adalah model encoder-only yang dilatih dengan dua teknik: Masked Language Modeling (MLM) dan Next Sentence Prediction (NSP).

Diagram: BERT vs GPT Architecture
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  BERT (Encoder-only)          GPT (Decoder-only)               โ”‚
โ”‚                                                                  โ”‚
โ”‚  โ”Œโ”€โ”€โ”€[CLS] tok1 tok2 [MASK]   tok1 tok2 tok3 โ†’ [NEXT]        โ”‚
โ”‚  โ”‚      โ†“    โ†“    โ†“    โ†“         โ†“    โ†“    โ†“     โ†“             โ”‚
โ”‚  โ”‚     โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—        โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—              โ”‚
โ”‚  โ”‚     โ•‘   Encoder     โ•‘        โ•‘   Decoder     โ•‘              โ”‚
โ”‚  โ”‚     โ•‘  (Bidirect.)  โ•‘        โ•‘  (Causal)     โ•‘              โ”‚
โ”‚  โ”‚     โ•‘  12 layers    โ•‘        โ•‘  12+ layers   โ•‘              โ”‚
โ”‚  โ”‚     โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•        โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•              โ”‚
โ”‚  โ”‚      โ†“    โ†“    โ†“    โ†“         โ†“    โ†“    โ†“                   โ”‚
โ”‚  โ”‚   [output embeddings]       [next token pred]               โ”‚
โ”‚  โ”‚                                                             โ”‚
โ”‚  โ†’ Untuk: klasifikasi,           โ†’ Untuk: text generation,     โ”‚
โ”‚    NER, QA, similarity             chatbot, summarization      โ”‚
โ”‚    (understanding tasks)            (generation tasks)         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
Python โ€” BERT untuk Berbagai Tasks
# =============================================
# BERT untuk NLP Tasks
# =============================================
from transformers import pipeline

# ----- 1. Masked Language Model -----
fill_mask = pipeline("fill-mask", model="bert-base-uncased")
results = fill_mask("Transformers are [MASK] for NLP tasks.")
for r in results:
    print(f"  {r['token_str']}: {r['score']:.3f}")
# "good": 0.234, "used": 0.189, "great": 0.087, ...

# ----- 2. Sentence Embeddings -----
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

def get_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
    with torch.no_grad():
        outputs = model(**inputs)
    # Mean pooling
    embedding = outputs.last_hidden_state.mean(dim=1)
    return embedding[0]

emb1 = get_embedding("Saya suka belajar AI")
emb2 = get_embedding("AI sangat menarik untuk dipelajari")
emb3 = get_embedding("Resep nasi goreng enak")

# Cosine similarity
from torch.nn.functional import cosine_similarity
print(f"AI vs AI: {cosine_similarity(emb1, emb2):.3f}")   # ~0.85
print(f"AI vs Food: {cosine_similarity(emb1, emb3):.3f}") # ~0.15

5. Arsitektur GPT

GPT (Generative Pre-trained Transformer) adalah model decoder-only yang dilatih untuk memprediksi token berikutnya. GPT unggul dalam text generation, conversation, dan creative tasks.

Python โ€” Text Generation dengan GPT
# =============================================
# Text Generation dengan GPT
# =============================================
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM

# ----- Menggunakan pipeline -----
generator = pipeline("text-generation", model="gpt2")
result = generator(
    "Artificial Intelligence di Indonesia",
    max_length=100,
    num_return_sequences=2,
    temperature=0.7,
    top_p=0.9,
    do_sample=True
)
for i, r in enumerate(result):
    print(f"Generated {i+1}: {r['generated_text']}")

# ----- GPT dengan Hugging Face -----
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

# Manual generation
input_text = "Machine learning adalah"
input_ids = tokenizer.encode(input_text, return_tensors="pt")

output = model.generate(
    input_ids,
    max_length=150,
    temperature=0.7,
    top_k=50,
    top_p=0.95,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)
generated = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated)

# ----- Menggunakan OpenAI API -----
from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Anda adalah penulis teknis."},
        {"role": "user", "content": "Jelaskan Transformers dalam 100 kata."}
    ],
    temperature=0.7,
    max_tokens=200
)
print(response.choices[0].message.content)

6. Hugging Face Ecosystem

Hugging Face adalah "GitHub untuk AI" โ€” platform utama untuk berbagi model, dataset, dan demo ML. HF Transformers library adalah standar industri untuk NLP.

Python โ€” Hugging Face Essentials
# =============================================
# Hugging Face Ecosystem
# =============================================
# pip install transformers datasets evaluate huggingface_hub

# ----- 1. Pipeline (Paling Mudah) -----
from transformers import pipeline

# Sentiment Analysis
classifier = pipeline("sentiment-analysis", model="indobenchmark/indobert-base-p1")
result = classifier("Film ini sangat bagus dan menarik!")
print(result)  # [{'label': 'POSITIVE', 'score': 0.98}]

# Text Summarization
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
summary = summarizer(long_text, max_length=100, min_length=30)

# Translation (Indonesian โ†’ English)
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-id-en")
result = translator("Saya suka belajar pemrograman")
print(result)  # [{'translation_text': 'I like learning programming'}]

# Named Entity Recognition
ner = pipeline("ner", model="dslim/bert-base-NER", grouped_entities=True)
entities = ner("Barack Obama lahir di Hawaii pada tahun 1961")
print(entities)

# Question Answering
qa = pipeline("question-answering", model="deepset/roberta-base-squad2")
result = qa(question="Siapa pendiri Tesla?", context="Tesla didirikan oleh Elon Musk...")
print(result)  # {'answer': 'Elon Musk', 'score': 0.95}

# ----- 2. Load Dataset dari Hub -----
from datasets import load_dataset
dataset = load_dataset("imdb", split="train[:1000]")
print(dataset)
print(dataset[0])  # {'text': '...', 'label': 1}

# ----- 3. Push Model ke Hub -----
from huggingface_hub import login
login(token="hf_...")

# Push model dan tokenizer
model.push_to_hub("username/my-indonesian-sentiment-model")
tokenizer.push_to_hub("username/my-indonesian-sentiment-model")

7. Fine-tuning untuk NLP Tasks

Python โ€” Fine-tuning BERT untuk Sentiment Analysis
# =============================================
# Fine-tuning BERT untuk Sentiment Analysis
# =============================================
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    TrainingArguments, Trainer
)
from datasets import load_dataset
import evaluate
import numpy as np

# Load dataset
dataset = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def tokenize(batch):
    return tokenizer(batch["text"], padding=True, truncation=True, max_length=256)

tokenized = dataset.map(tokenize, batched=True)
tokenized = tokenized.rename_column("label", "labels")

# Load model
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=2
)

# Metrics
accuracy = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return accuracy.compute(predictions=preds, references=labels)

# Training
args = TrainingArguments(
    output_dir="./bert-sentiment",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    learning_rate=2e-5,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to="none"
)

trainer = Trainer(
    model=model, args=args,
    train_dataset=tokenized["train"].shuffle(seed=42).select(range(5000)),
    eval_dataset=tokenized["test"].shuffle(seed=42).select(range(1000)),
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()
results = trainer.evaluate()
print(f"Accuracy: {results['eval_accuracy']:.4f}")

8. NLP Tasks Populer

TaskModelPipeline Name
Text ClassificationBERT, RoBERTatext-classification
Named Entity RecognitionBERT-NERner
Question AnsweringRoBERTa-SQuADquestion-answering
SummarizationBART, T5summarization
TranslationMarian, NLLBtranslation
Text GenerationGPT-2, Llamatext-generation
Fill MaskBERTfill-mask
Semantic SimilaritySentence-BERTsentence-similarity

9. Deployment

Python โ€” Deploy NLP Model
# =============================================
# Deploy NLP Model dengan FastAPI
# =============================================
from fastapi import FastAPI
from pydantic import BaseModel
from transformers import pipeline

app = FastAPI()
classifier = pipeline("sentiment-analysis", model="./bert-sentiment")

class TextInput(BaseModel):
    text: str

@app.post("/predict")
async def predict(input: TextInput):
    result = classifier(input.text)
    return {"sentiment": result[0]["label"], "score": result[0]["score"]}

@app.post("/predict-batch")
async def predict_batch(texts: list[str]):
    results = classifier(texts)
    return {"results": results}

# Hugging Face Inference API (tanpa deploy)
# POST https://api-inference.huggingface.co/models/{model_id}
# Headers: {"Authorization": "Bearer hf_..."}
# Body: {"inputs": "teks Anda di sini"}

10. Quiz Pemahaman

1. Apa perbedaan utama BERT dan GPT?

2. Apa itu Self-Attention dalam Transformers?

3. Mengapa subword tokenization lebih baik dari word-level?

4. Apa fungsi dari Hugging Face Transformers library?

5. Apa keunggulan pre-training + fine-tuning dibanding training dari nol?

Rangkuman

๐Ÿ“ Poin Penting
  • Transformers โ€” arsitektur revolusioner dengan self-attention, paralel processing
  • BERT โ€” encoder-only, untuk understanding tasks (klasifikasi, NER, QA)
  • GPT โ€” decoder-only, untuk generation tasks (chatbot, summarization)
  • Tokenization โ€” subword (WordPiece, BPE) untuk menangani OOV
  • Attention โ€” mekanisme Q, K, V untuk menentukan relevansi antar token
  • Hugging Face โ€” ekosistem utama untuk load, train, deploy model NLP
  • Fine-tuning โ€” transfer learning: pre-trained + data spesifik = performa tinggi