1. Pengenalan NLP Text Classification
Text Classification adalah salah satu tugas paling fundamental dalam Natural Language Processing (NLP) β memberikan label/kategori pada teks berdasarkan isinya. Tugas ini ada di mana-mana: dari filter spam email, analisis sentimen review produk, hingga moderasi konten di media sosial.
Aplikasi Text Classification
| Aplikasi | Tipe | Contoh |
|---|---|---|
| Sentiment Analysis | Binary/Multi-class | Review: positif/negatif/netral |
| Spam Detection | Binary | Email: spam/not spam |
| Topic Classification | Multi-class | Berita: olahraga/politik/teknologi |
| Intent Detection | Multi-class | Chatbot: beli/cek_status/komplain |
| Language Detection | Multi-class | ID/EN/MS/ZH |
| Emotion Detection | Multi-class | Senang/sedih/marah/takut |
| Hate Speech Detection | Binary | Konten ofensif/aman |
| Spam Review Detection | Binary | Review palsu/asli |
Pipeline Text Classification
ββββββββββββ βββββββββββββ ββββββββββββββββ ββββββββββββ ββββββββββββ
β Raw Text ββββΊβ Text ββββΊβ Tokenization ββββΊβ Vectori- β β β
β "Produk β β Preprocessβ β β β zation / ββββΊβ Model β
β ini β β β β β β Embeddingβ β β
β bagus!" β β β β β β β β β
ββββββββββββ βββββββββββββ ββββββββββββββββ ββββββββββββ ββββββ¬ββββββ
β
Pipeline Tradisional: βΌ
1. Lowercase, hapus noise ββββββββββββ
2. Tokenisasi kata β Output β
3. Bag-of-Words / TF-IDF β Label β
4. Naive Bayes / SVM β (Positif)β
ββββββββββββ
Pipeline Modern:
1. Tokenisasi subword (BPE/WordPiece)
2. Pre-trained embeddings (BERT)
3. Fine-tune Transformer model
2. Text Preprocessing
Text preprocessing adalah langkah penting untuk membersihkan dan menormalisasi teks sebelum dijadikan input model. Kualitas preprocessing sangat mempengaruhi hasil akhir model.
Langkah-Langkah Preprocessing
| Langkah | Penjelasan | Contoh |
|---|---|---|
| Lowercasing | Ubah semua huruf jadi kecil | "Bagus!" β "bagus!" |
| Hapus HTML Tags | Bersihkan markup HTML | "<p>teks</p>" β "teks" |
| Hapus URL | Buang link/url | "kunjungi http://x.com" β "kunjungi" |
| Hapus Special Characters | Buang simbol, angka tertentu | "good!!! #nice" β "good nice" |
| Hapus Stopwords | Buang kata umum (di, ke, yang, dan) | "saya suka makan nasi" β "suka makan nasi" |
| Stemming | Potong imbuhan β akar kata | "bermain" β "main", "kebersihan" β "bersih" |
| Lemma | Ubah ke bentuk dasar (lebih akurat dari stemming) | "running" β "run", "better" β "good" |
| Hapus Emoji | Buang emoji dari teks | "bagus ππ" β "bagus" |
Model Transformer modern seperti BERT dan GPT TIDAK membutuhkan preprocessing agresif! Stopwords, punctuation, bahkan kapitalisasi membawa informasi penting. Untuk BERT/GPT: cukup tokenisasi subword (WordPiece/BPE) tanpa stemming atau hapus stopwords.
import re
import string
# =============================================
# TEXT PREPROCESSING FUNCTIONS
# =============================================
def preprocess_text(text, remove_stopwords=False, do_stemming=False):
"""Pipeline preprocessing untuk teks bahasa Indonesia."""
# 1. Lowercase
text = text.lower()
# 2. Hapus HTML tags
text = re.sub(r'<[^>]+>', '', text)
# 3. Hapus URLs
text = re.sub(r'http\S+|www\.\S+', '', text)
# 4. Hapus email
text = re.sub(r'\S+@\S+', '', text)
# 5. Hapus emoji dan special characters
text = re.sub(r'[^\w\s]', ' ', text) # Keep only words & spaces
# 6. Hapus angka
text = re.sub(r'\d+', '', text)
# 7. Hapus extra whitespace
text = re.sub(r'\s+', ' ', text).strip()
# 8. Hapus stopwords (optional)
if remove_stopwords:
# Stopwords bahasa Indonesia (contoh sederhana)
stop_words = {
'yang', 'di', 'dan', 'ini', 'itu', 'dengan', 'untuk',
'pada', 'ke', 'dari', 'ada', 'adalah', 'akan', 'juga',
'saya', 'kamu', 'dia', 'mereka', 'kami', 'kita', 'bisa',
'tidak', 'bukan', 'belum', 'sudah', 'telah', 'atau',
'tapi', 'namun', 'jika', 'maka', 'serta', 'oleh', 'lebih'
}
tokens = text.split()
tokens = [t for t in tokens if t not in stop_words]
text = ' '.join(tokens)
return text
# Contoh penggunaan
texts_raw = [
"Produk ini BAGUS banget!!! ππ http://shopee.com",
"Saya sangat kecewa dengan pelayanan yang buruk :(
",
"Harga Rp 50.000 sangat murah untuk kualitas ini! π",
"Barang datang dalam 3 hari. Packaging rapi. Recommended!!!",
"Saya sudah kirim email ke support@toko.com tapi tidak dibalas"
]
print("=" * 70)
print("TEXT PREPROCESSING RESULTS")
print("=" * 70)
for text in texts_raw:
clean = preprocess_text(text, remove_stopwords=True)
print(f"\nOriginal: {text[:60]}...")
print(f"Cleaned: {clean}")
3. Tokenization
Tokenization adalah proses memecah teks menjadi unit-unit yang lebih kecil (token). Ada beberapa level tokenisasi dan metode yang berbeda-beda.
Jenis Tokenization
1. WORD-LEVEL TOKENIZATION (Tradisional): "Saya suka Machine Learning" β ["Saya", "suka", "Machine", "Learning"] β Mudah dipahami β Vocabulary besar (100k+ kata) β Out-of-Vocabulary (OOV) problem: kata baru tidak dikenal β Tidak bisa handle typo: "machne" β ??? 2. CHARACTER-LEVEL TOKENIZATION: "Hello" β ['H', 'e', 'l', 'l', 'o'] β Vocabulary sangat kecil (~200 karakter) β Zero OOV problem β Sequence sangat panjang β Sulit menangkap makna kata 3. SUBWORD TOKENIZATION (Modern! β): "unhappiness" β ["un", "##happi", "##ness"] β Vocabulary moderat (30k-50k) β Sangat sedikit OOV β Sequence tidak terlalu panjang β Menangkap morfologi kata Metode: β’ BPE (Byte Pair Encoding) β GPT, LLaMA β’ WordPiece β BERT β’ SentencePiece β T5, mBART β’ Unigram β XLNet
from transformers import AutoTokenizer
import json
# =============================================
# 1. WORD TOKENIZATION (sederhana)
# =============================================
def simple_tokenize(text):
"""Tokenisasi sederhana: split + lowercase."""
return text.lower().split()
text = "Natural Language Processing sangat menarik untuk dipelajari!"
tokens = simple_tokenize(text)
print(f"Word tokens: {tokens}")
print(f"Token count: {len(tokens)}")
# =============================================
# 2. SUBWORD TOKENIZATION dengan HuggingFace
# =============================================
# BERT WordPiece tokenizer
bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
# GPT BPE tokenizer
gpt_tokenizer = AutoTokenizer.from_pretrained('gpt2')
# Multilingual tokenizer (support bahasa Indonesia!)
mbert_tokenizer = AutoTokenizer.from_pretrained('bert-base-multilingual-cased')
text_en = "Natural Language Processing is fascinating!"
text_id = "Pemrosesan Bahasa Alami sangat menarik!"
print("\n" + "=" * 60)
print("SUBWORD TOKENIZATION COMPARISON")
print("=" * 60)
# English text
print(f"\nEnglish: '{text_en}'")
print(f"BERT (WordPiece): {bert_tokenizer.tokenize(text_en)}")
print(f"GPT-2 (BPE): {gpt_tokenizer.tokenize(text_en)}")
# Indonesian text
print(f"\nIndonesian: '{text_id}'")
print(f"mBERT (WordPiece): {mbert_tokenizer.tokenize(text_id)}")
# Show token IDs
tokens_en = bert_tokenizer(text_en, return_tensors='pt')
print(f"\nBERT Token IDs: {tokens_en['input_ids'].tolist()}")
print(f"Decoded: {bert_tokenizer.decode(tokens_en['input_ids'][0])}")
# Special tokens
print(f"\n=== BERT Special Tokens ===")
print(f"[CLS]: {bert_tokenizer.cls_token_id}")
print(f"[SEP]: {bert_tokenizer.sep_token_id}")
print(f"[PAD]: {bert_tokenizer.pad_token_id}")
print(f"[UNK]: {bert_tokenizer.unk_token_id}")
print(f"[MASK]: {bert_tokenizer.mask_token_id}")
# Vocabulary size
print(f"\n=== Vocabulary Size ===")
print(f"BERT (uncased): {bert_tokenizer.vocab_size:,}")
print(f"GPT-2: {gpt_tokenizer.vocab_size:,}")
print(f"mBERT (multilingual): {mbert_tokenizer.vocab_size:,}")
4. Text Representation
Setelah teks di-tokenisasi, kita perlu mengubah token menjadi vektor numerik yang bisa diproses oleh model. Ada beberapa metode dari yang paling sederhana hingga yang paling canggih.
Bag-of-Words (BoW)
Bag-of-Words merepresentasikan teks berdasarkan frekuensi kemunculan setiap kata. Urutan kata diabaikan β hanya hitungan yang diperhatikan.
TF-IDF
TF-IDF (Term Frequency - Inverse Document Frequency) memperbaiki BoW dengan memperhitungkan pentingnya kata dalam konteks keseluruhan korpus.
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
# Contoh corpus
corpus = [
"Saya suka makan nasi goreng",
"Nasi goreng enak sekali",
"Saya suka minum kopi",
"Kopi dan nasi goreng adalah makanan favorit saya",
"Makan nasi goreng sambil minum kopi sangat nikmat"
]
# =============================================
# 1. BAG OF WORDS (BoW)
# =============================================
bow = CountVectorizer()
bow_matrix = bow.fit_transform(corpus)
print("=" * 60)
print("BAG OF WORDS (BoW)")
print("=" * 60)
print(f"Vocabulary: {bow.get_feature_names_out()}")
print(f"Matrix shape: {bow_matrix.shape}")
print(f"\nBoW Matrix (first 3 docs):")
for i in range(3):
doc_vector = bow_matrix.toarray()[i]
nonzero = {k: v for k, v in zip(bow.get_feature_names_out(), doc_vector) if v > 0}
print(f" Doc {i+1}: {nonzero}")
# =============================================
# 2. TF-IDF
# =============================================
tfidf = TfidfVectorizer(max_features=20, ngram_range=(1, 2))
tfidf_matrix = tfidf.fit_transform(corpus)
print("\n" + "=" * 60)
print("TF-IDF")
print("=" * 60)
print(f"Features: {tfidf.get_feature_names_out()}")
print(f"Matrix shape: {tfidf_matrix.shape}")
# Show top TF-IDF terms per document
print(f"\nTop terms per document:")
for i, doc in enumerate(corpus):
feature_names = tfidf.get_feature_names_out()
tfidf_scores = tfidf_matrix.toarray()[i]
top_indices = np.argsort(tfidf_scores)[::-1][:3]
top_terms = [(feature_names[j], tfidf_scores[j]) for j in top_indices if tfidf_scores[j] > 0]
print(f" Doc {i+1}: {top_terms}")
# =============================================
# 3. N-GRAMS
# =============================================
print("\n" + "=" * 60)
print("N-GRAMS (Unigram, Bigram, Trigram)")
print("=" * 60)
text = "saya sangat suka belajar machine learning"
words = text.split()
# Unigrams
unigrams = words
print(f"Unigrams: {unigrams}")
# Bigrams
bigrams = [f"{words[i]}_{words[i+1]}" for i in range(len(words)-1)]
print(f"Bigrams: {bigrams}")
# Trigrams
trigrams = [f"{words[i]}_{words[i+1]}_{words[i+2]}" for i in range(len(words)-2)]
print(f"Trigrams: {trigrams}")
5. Word Embeddings (Word2Vec, GloVe)
Word Embeddings adalah representasi kata dalam bentuk vektor padat (dense vector) yang menangkap makna semantik. Berbeda dengan BoW/TF-IDF yang sparse dan tidak menangkap makna, embeddings menempatkan kata-kata yang bermakna mirip berdekatan dalam ruang vektor.
Word2Vec: "King - Man + Woman β Queen"
WORD EMBEDDINGS SPACE (2D projection via t-SNE):
animal
β
dog cat
\ /
\/
puppy /\hamster β Kata hewan berkelompok!
/ \
β
food
cake bread
\ /
\/
β gender
king queen β "king - man + woman β queen"
man woman
boy girl
PROPERTIES:
β’ Similar words have similar vectors (cosine distance kecil)
β’ Arithmetic semantics: vector("king") - vector("man") + vector("woman")
β vector("queen")
β’ 50-300 dimensi (vs BoW: 10k+ dimensi sparse)
METODE POPULER:
βββββββββββββββ¬βββββββββββ¬βββββββββββββββββββββββββββββββ
β Model β Dimensi β Metode Training β
βββββββββββββββΌβββββββββββΌβββββββββββββββββββββββββββββββ€
β Word2Vec β 100-300 β Skip-gram / CBOW (Google 2013)β
β GloVe β 50-300 β Co-occurrence matrix (Stanfordβ
β FastText β 100-300 β Subword + Word2Vec (FB 2016) β
β ELMo β 1024 β Bidirectional LSTM (2018) β
β BERT β 768-1024 β Transformer (contextual, 2018)β
βββββββββββββββ΄βββββββββββ΄βββββββββββββββββββββββββββββββ
Note: Word2Vec/GloVe = NON-contextual (setiap kata = 1 vektor tetap)
BERT/ELMo = CONTEXTUAL (vektor berubah berdasarkan kalimat!)
"bank" di "river bank" vs "bank account" β vektor berbeda!
import numpy as np
from gensim.models import Word2Vec
from sklearn.metrics.pairwise import cosine_similarity
# =============================================
# 1. TRAIN WORD2VEC dari data lokal
# =============================================
# Contoh korpus (bisa dari dataset besar)
corpus = [
['saya', 'suka', 'makan', 'nasi', 'goreng'],
['nasi', 'goreng', 'enak', 'sekali'],
['saya', 'suka', 'minum', 'kopi', 'panas'],
['kopi', 'dan', 'nasi', 'goreng', 'makanan', 'favorit'],
['makan', 'nasi', 'goreng', 'sambil', 'minum', 'kopi'],
['saya', 'suka', 'makan', 'bakso', 'dan', 'soto'],
['bakso', 'adalah', 'makanan', 'yang', 'enak'],
['soto', 'juga', 'makanan', 'favorit', 'saya'],
]
# Train Word2Vec (Skip-gram)
model_w2v = Word2Vec(
sentences=corpus,
vector_size=50, # Embedding dimensions
window=3, # Context window size
min_count=1, # Minimum word frequency
sg=1, # 1=Skip-gram, 0=CBOW
epochs=100,
seed=42
)
print("=" * 60)
print("WORD2VEC RESULTS")
print("=" * 60)
# Get embedding for a word
word = 'nasi'
print(f"\nEmbedding for '{word}': {model_w2v.wv[word][:10]}...") # first 10 dims
print(f"Embedding shape: {model_w2v.wv[word].shape}")
# Most similar words
print(f"\nMost similar to 'nasi':")
for word, score in model_w2v.wv.most_similar('nasi', topn=5):
print(f" {word}: {score:.4f}")
# Word similarity
print(f"\nSimilarity between pairs:")
pairs = [
('nasi', 'goreng'),
('nasi', 'bakso'),
('saya', 'goreng'),
('kopi', 'makanan'),
]
for w1, w2 in pairs:
sim = model_w2v.wv.similarity(w1, w2)
print(f" sim('{w1}', '{w2}') = {sim:.4f}")
# Analogy: king - man + woman = queen (with real embeddings)
# With small corpus, we demonstrate the concept
print(f"\nWord analogy (simulated):")
print(f" nasi β makanan (staple food)")
print(f" kopi β minuman (drink)")
# =============================================
# 2. PRE-TRAINED EMBEDDINGS (GloVe)
# =============================================
# In practice, use pre-trained embeddings:
# Download from: https://nlp.stanford.edu/projects/glove/
# File: glove.6B.100d.txt (100-dimensional, 6B tokens)
print("\n" + "=" * 60)
print("PRE-TRAINED EMBEDDINGS INFO")
print("=" * 60)
print("GloVe (Stanford NLP):")
print(" - glove.6B.50d.txt (50 dim, 400K vocab)")
print(" - glove.6B.100d.txt (100 dim, 400K vocab)")
print(" - glove.6B.200d.txt (200 dim, 400K vocab)")
print(" - glove.6B.300d.txt (300 dim, 400K vocab)")
print("\nFastText (Facebook):")
print(" - 157 languages, 300 dim")
print(" - Bisa handle kata OOV via subword!")
print(" - Download: https://fasttext.cc/docs/en/crawl-vectors.html")
print("\nUntuk bahasa Indonesia:")
print(" - fastText.id.300d.txt")
print(" - word2vec.id.300d.txt (from various sources)")
6. Model Klasifikasi Tradisional
Sebelum deep learning mendominasi, ada beberapa model machine learning tradisional yang sangat efektif untuk text classification, terutama dengan TF-IDF features.
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')
# =============================================
# DATASET: Simulated Sentiment Analysis
# =============================================
# (Di real-world: gunakan dataset dari Kaggle, dsb.)
texts = [
"Produk ini sangat bagus dan berkualitas tinggi",
"Saya sangat puas dengan pembelian ini",
"Barang cepat sampai dan sesuai deskripsi",
"Kualitas sangat baik, recommended banget",
"Pelayanan ramah dan pengiriman cepat",
"Sangat senang dengan hasilnya, bagus sekali",
"Makanan enak dan pelayanan memuaskan",
"Gadget ini canggih dan berfungsi sempurna",
"Harga sebanding dengan kualitas",
"Akan beli lagi di toko ini, top!",
"Produk sangat mengecewakan dan rusak",
"Barang tidak sesuai foto, sangat buruk",
"Pengiriman sangat lambat dan tidak aman",
"Kualitas jelek, buang uang saja",
"Pelayanan buruk, tidak ramah",
"Barang palsu dan tidak berfungsi",
"Sangat kecewa dengan produk ini",
"Makanan tidak enak dan pelayanan lambat",
"Harga mahal tapi kualitas rendah",
"Tidak akan pernah beli di sini lagi"
] * 25 # Repeat for more data
labels = ([1]*10 + [0]*10) * 25 # 1=positif, 0=negatif
X_train, X_test, y_train, y_test = train_test_split(
texts, labels, test_size=0.2, random_state=42, stratify=labels
)
print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
# =============================================
# COMPARING CLASSIFIERS
# =============================================
classifiers = {
'Naive Bayes': MultinomialNB(alpha=1.0),
'Logistic Regression': LogisticRegression(max_iter=1000, C=1.0),
'Linear SVM': LinearSVC(C=1.0, max_iter=5000),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
}
print("\n" + "=" * 60)
print("COMPARISON: TF-IDF + CLASSIFIERS")
print("=" * 60)
results = {}
for name, clf in classifiers.items():
# Pipeline: TF-IDF β Classifier
pipeline = Pipeline([
('tfidf', TfidfVectorizer(max_features=5000, ngram_range=(1, 2))),
('clf', clf)
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
acc = accuracy_score(y_test, y_pred)
cv_scores = cross_val_score(pipeline, texts, labels, cv=5, scoring='accuracy')
results[name] = {
'accuracy': acc,
'cv_mean': cv_scores.mean(),
'cv_std': cv_scores.std()
}
print(f"\n{name}:")
print(f" Test Accuracy: {acc:.4f}")
print(f" CV Accuracy: {cv_scores.mean():.4f} Β± {cv_scores.std():.4f}")
# Best model
best_name = max(results, key=lambda x: results[x]['cv_mean'])
print(f"\n{'='*60}")
print(f"BEST MODEL: {best_name} (CV: {results[best_name]['cv_mean']:.4f})")
print(f"{'='*60}")
# =============================================
# PREDICTION EXAMPLE
# =============================================
best_pipeline = Pipeline([
('tfidf', TfidfVectorizer(max_features=5000, ngram_range=(1, 2))),
('clf', classifiers[best_name])
])
best_pipeline.fit(X_train, y_train)
new_texts = [
"Produk ini luar biasa bagus, saya sangat puas!",
"Barang rusak dan pelayanan sangat buruk",
"Biasa saja, tidak terlalu bagus tapi tidak jelek juga",
"Pengiriman cepat dan packaging rapi, terima kasih!"
]
predictions = best_pipeline.predict(new_texts)
labels_map = {0: 'Negatif', 1: 'Positif'}
print(f"\n=== Predictions (Best: {best_name}) ===")
for text, pred in zip(new_texts, predictions):
print(f" [{labels_map[pred]:>8}] {text}")
7. Deep Learning untuk NLP
Evolusi Model NLP
Evolusi NLP Models:
2003: Neural LM (Bengio)
β β’ Word embeddings pertama
β
2013: Word2Vec (Mikolov)
β β’ Pre-trained word vectors β transfer learning
β β’ "king - man + woman β queen"
β
2014: seq2seq + Attention (Sutskever, Bahdanau)
β β’ Encoder-Decoder LSTM
β β’ Neural machine translation
β
2018: ELMo (Peters)
β β’ Contextual word embeddings
β β’ Same word β different vectors per context
β
2018: BERT (Google) β
β β’ Bidirectional Transformer encoder
β β’ Pre-train β Fine-tune paradigm
β β’ SOTA di 11 NLP tasks sekaligus!
β
2019: GPT-2 (OpenAI)
β β’ Autoregressive Transformer decoder
β β’ Generative language model
β
2020: GPT-3 (OpenAI)
β β’ 175B parameters
β β’ In-context learning (few-shot)
β
2022: ChatGPT / InstructGPT
β β’ RLHF alignment
β β’ Conversational AI
β
2023-2025: GPT-4, LLaMA, Gemini, Claude
β’ Multi-modal
β’ Long context (100k+ tokens)
β’ Reasoning capabilities
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from collections import Counter
import numpy as np
# =============================================
# 1. VOCABULARY BUILDER
# =============================================
class Vocabulary:
def __init__(self, max_size=10000):
self.word2idx = {'': 0, '': 1}
self.idx2word = {0: '', 1: ''}
self.max_size = max_size
def build(self, texts):
counter = Counter()
for text in texts:
counter.update(text.lower().split())
most_common = counter.most_common(self.max_size - 2)
for idx, (word, count) in enumerate(most_common, start=2):
self.word2idx[word] = idx
self.idx2word[idx] = word
def encode(self, text, max_len=50):
tokens = text.lower().split()
ids = [self.word2idx.get(t, 1) for t in tokens[:max_len]]
# Pad
ids += [0] * (max_len - len(ids))
return ids
def __len__(self):
return len(self.word2idx)
# =============================================
# 2. DATASET
# =============================================
class TextDataset(Dataset):
def __init__(self, texts, labels, vocab, max_len=50):
self.texts = texts
self.labels = labels
self.vocab = vocab
self.max_len = max_len
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
encoded = self.vocab.encode(self.texts[idx], self.max_len)
return torch.tensor(encoded, dtype=torch.long), torch.tensor(self.labels[idx], dtype=torch.long)
# =============================================
# 3. LSTM MODEL
# =============================================
class LSTMClassifier(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes,
num_layers=2, dropout=0.3, bidirectional=True):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
self.lstm = nn.LSTM(
embed_dim, hidden_dim,
num_layers=num_layers,
batch_first=True,
bidirectional=bidirectional,
dropout=dropout
)
self.dropout = nn.Dropout(dropout)
# Bidirectional β hidden_dim * 2
direction_factor = 2 if bidirectional else 1
self.fc = nn.Sequential(
nn.Linear(hidden_dim * direction_factor, hidden_dim),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(hidden_dim, num_classes)
)
def forward(self, x):
# x: (batch, seq_len)
embedded = self.embedding(x) # (batch, seq_len, embed_dim)
lstm_out, (hidden, cell) = self.lstm(embedded)
# Concatenate final hidden state dari kedua arah
if self.lstm.bidirectional:
hidden_cat = torch.cat([hidden[-2], hidden[-1]], dim=1)
else:
hidden_cat = hidden[-1]
out = self.dropout(hidden_cat)
out = self.fc(out)
return out
# =============================================
# 4. TRAINING
# =============================================
# Sample data
texts = [
"Produk sangat bagus dan berkualitas", "Saya sangat puas dengan layanan",
"Barang rusak dan tidak sesuai", "Pelayanan sangat buruk dan lambat",
"Makanan enak dan porsi besar", "Kualitas sangat memuaskan",
"Barang palsu dan tidak original", "Sangat kecewa dengan pembelian",
"Pengiriman cepat dan aman", "Harga terjangkau kualitas bagus",
] * 30
labels = [1, 1, 0, 0, 1, 1, 0, 0, 1, 1] * 30
# Build vocabulary
vocab = Vocabulary(max_size=1000)
vocab.build(texts)
# Create datasets
train_dataset = TextDataset(texts[:200], labels[:200], vocab)
test_dataset = TextDataset(texts[200:], labels[200:], vocab)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)
# Model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = LSTMClassifier(
vocab_size=len(vocab),
embed_dim=64,
hidden_dim=128,
num_classes=2,
num_layers=2,
bidirectional=True
).to(device)
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()
# Train loop (simplified)
print("Training LSTM Text Classifier...")
print(f"Vocab size: {len(vocab)}")
print(f"Model params: {sum(p.numel() for p in model.parameters()):,}")
for epoch in range(10):
model.train()
total_loss = 0
correct = 0
total = 0
for texts_batch, labels_batch in train_loader:
texts_batch = texts_batch.to(device)
labels_batch = labels_batch.to(device)
optimizer.zero_grad()
outputs = model(texts_batch)
loss = criterion(outputs, labels_batch)
loss.backward()
optimizer.step()
total_loss += loss.item()
_, predicted = outputs.max(1)
total += labels_batch.size(0)
correct += predicted.eq(labels_batch).sum().item()
train_acc = 100. * correct / total
# Test
model.eval()
test_correct = 0
test_total = 0
with torch.no_grad():
for texts_batch, labels_batch in test_loader:
texts_batch = texts_batch.to(device)
labels_batch = labels_batch.to(device)
outputs = model(texts_batch)
_, predicted = outputs.max(1)
test_total += labels_batch.size(0)
test_correct += predicted.eq(labels_batch).sum().item()
test_acc = 100. * test_correct / test_total
print(f"Epoch [{epoch+1:2d}/10] Loss: {total_loss:.4f} "
f"Train: {train_acc:.1f}% Test: {test_acc:.1f}%")
8. Fine-tuning BERT untuk Klasifikasi
Fine-tuning BERT adalah pendekatan terkuat untuk text classification saat ini. Kita mengambil BERT yang sudah pre-trained di korpus besar (Wikipedia, Books) dan melatih ulang (fine-tune) pada dataset spesifik kita.
import torch
from transformers import (
BertTokenizer, BertForSequenceClassification,
AdamW, get_linear_schedule_with_warmup
)
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import numpy as np
# =============================================
# 1. CUSTOM DATASET
# =============================================
class TextClassificationDataset(Dataset):
def __init__(self, texts, labels, tokenizer, max_len=128):
self.texts = texts
self.labels = labels
self.tokenizer = tokenizer
self.max_len = max_len
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
encoding = self.tokenizer(
self.texts[idx],
max_length=self.max_len,
padding='max_length',
truncation=True,
return_tensors='pt'
)
return {
'input_ids': encoding['input_ids'].squeeze(0),
'attention_mask': encoding['attention_mask'].squeeze(0),
'labels': torch.tensor(self.labels[idx], dtype=torch.long)
}
# =============================================
# 2. PREPARE DATA
# =============================================
texts = [
"Produk ini sangat bagus dan berkualitas tinggi",
"Saya sangat puas dengan pembelian ini",
"Barang cepat sampai dan sesuai deskripsi",
"Kualitas sangat baik, recommended",
"Pelayanan ramah dan pengiriman cepat",
"Sangat senang dengan hasilnya",
"Produk sangat mengecewakan dan rusak",
"Barang tidak sesuai foto, sangat buruk",
"Pengiriman sangat lambat",
"Kualitas jelek, buang uang saja",
"Pelayanan buruk dan tidak ramah",
"Barang palsu dan tidak berfungsi",
] * 30 # Repeat for enough data
labels = [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0] * 30
X_train, X_val, y_train, y_val = train_test_split(
texts, labels, test_size=0.2, random_state=42, stratify=labels
)
# =============================================
# 3. LOAD BERT
# =============================================
MODEL_NAME = 'indobenchmark/indobert-base-p1' # BERT bahasa Indonesia!
# Alternatif: 'bert-base-multilingual-cased' (multilingual)
tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)
model = BertForSequenceClassification.from_pretrained(
MODEL_NAME, num_labels=2
)
print(f"Model: {MODEL_NAME}")
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
# Create DataLoaders
train_dataset = TextClassificationDataset(X_train, y_train, tokenizer, max_len=128)
val_dataset = TextClassificationDataset(X_val, y_val, tokenizer, max_len=128)
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16, shuffle=False)
# =============================================
# 4. FINE-TUNING
# =============================================
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
# Optimizer with weight decay
optimizer = AdamW(model.parameters(), lr=2e-5, weight_decay=0.01)
# Learning rate scheduler
num_epochs = 5
total_steps = len(train_loader) * num_epochs
scheduler = get_linear_schedule_with_warmup(
optimizer,
num_warmup_steps=int(0.1 * total_steps), # 10% warmup
num_training_steps=total_steps
)
# Training loop
print("\nFine-tuning BERT...")
for epoch in range(num_epochs):
# TRAIN
model.train()
total_loss = 0
train_preds, train_true = [], []
for batch in train_loader:
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels_batch = batch['labels'].to(device)
optimizer.zero_grad()
outputs = model(input_ids=input_ids, attention_mask=attention_mask,
labels=labels_batch)
loss = outputs.loss
logits = outputs.logits
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
scheduler.step()
total_loss += loss.item()
preds = torch.argmax(logits, dim=1).cpu().numpy()
train_preds.extend(preds)
train_true.extend(labels_batch.cpu().numpy())
train_acc = accuracy_score(train_true, train_preds)
avg_loss = total_loss / len(train_loader)
# VALIDATION
model.eval()
val_preds, val_true = [], []
with torch.no_grad():
for batch in val_loader:
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels_batch = batch['labels'].to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
preds = torch.argmax(outputs.logits, dim=1).cpu().numpy()
val_preds.extend(preds)
val_true.extend(labels_batch.cpu().numpy())
val_acc = accuracy_score(val_true, val_preds)
print(f"Epoch [{epoch+1}/{num_epochs}] "
f"Loss: {avg_loss:.4f} | "
f"Train Acc: {train_acc:.4f} | "
f"Val Acc: {val_acc:.4f}")
# =============================================
# 5. INFERENCE
# =============================================
model.eval()
new_texts = [
"Produk luar biasa bagus, sangat puas!",
"Barang rusak dan pelayanan sangat buruk",
"Cukup bagus untuk harga segitu"
]
print("\n=== Predictions ===")
for text in new_texts:
encoding = tokenizer(text, return_tensors='pt', max_length=128,
padding='max_length', truncation=True)
input_ids = encoding['input_ids'].to(device)
attention_mask = encoding['attention_mask'].to(device)
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
probs = torch.softmax(outputs.logits, dim=1)
pred = torch.argmax(probs, dim=1).item()
label = "Positif" if pred == 1 else "Negatif"
conf = probs[0][pred].item()
print(f" [{label} ({conf:.2%})] {text}")
- Learning rate sangat penting: Gunakan 2e-5 hingga 5e-5 (jangan terlalu besar!)
- Batch size kecil: 16 atau 32. Jika GPU memory kurang, gunakan gradient accumulation
- Epochs: 3-5 epochs biasanya sudah cukup (jangan terlalu banyak β overfitting)
- Warmup: 10% dari total steps untuk warmup
- Weight decay: 0.01 untuk regularisasi
- Gunakan BERT Indonesia:
indobenchmark/indobert-base-p1untuk teks bahasa Indonesia - Max length: 128 atau 256 (sesuaikan dengan panjang teks rata-rata)
- Mixed precision:
torch.cuda.ampuntuk mempercepat training 2-3Γ
9. Evaluasi & Best Practices
Metrik Evaluasi
| Metrik | Formula | Kapan Digunakan |
|---|---|---|
| Accuracy | (TP + TN) / Total | Dataset seimbang |
| Precision | TP / (TP + FP) | Ketika FP mahal (spam: jangan salah blok email normal) |
| Recall | TP / (TP + FN) | Ketika FN mahal (deteksi penyakit: jangan lewatkan kasus positif) |
| F1-Score | 2 Γ (P Γ R) / (P + R) | Dataset tidak seimbang, trade-off P dan R |
| AUC-ROC | Area under ROC curve | Threshold-independent evaluation |
| Confusion Matrix | TP, TN, FP, FN | Selalu! Visualisasi jenis error |
Error Analysis
from sklearn.metrics import confusion_matrix, classification_report
import numpy as np
# Simulated predictions
y_true = [1, 1, 1, 0, 0, 0, 1, 1, 0, 0]
y_pred = [1, 1, 0, 0, 0, 1, 1, 0, 0, 1]
texts_test = [
"Produk sangat bagus", # True: Pos, Pred: Pos β
"Saya puas sekali", # True: Pos, Pred: Pos β
"Cukup lumayan lah", # True: Pos, Pred: Neg β (edge case!)
"Barang rusak parah", # True: Neg, Pred: Neg β
"Tidak sesuai ekspektasi", # True: Neg, Pred: Neg β
"Biasa saja sih", # True: Neg, Pred: Pos β (edge case!)
"Luar biasa bagusnya", # True: Pos, Pred: Pos β
"Kurang memuaskan", # True: Pos, Pred: Neg β
"Pelayanan sangat buruk", # True: Neg, Pred: Neg β
"Lumayan bagus kok", # True: Neg, Pred: Pos β (edge case!)
]
# Confusion matrix
cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:")
print(f" Pred Neg Pred Pos")
print(f" True Neg: {cm[0][0]:>6} {cm[0][1]:>6}")
print(f" True Pos: {cm[1][0]:>6} {cm[1][1]:>6}")
print(f"\nClassification Report:")
print(classification_report(y_true, y_pred, target_names=['Negatif', 'Positif']))
# Error Analysis: Kumpulkan salah prediksi
print("\n=== ERROR ANALYSIS ===")
errors = []
for i, (true, pred, text) in enumerate(zip(y_true, y_pred, texts_test)):
if true != pred:
errors.append((text, true, pred))
print(f"Total errors: {len(errors)}/{len(y_true)}")
print(f"Error types:")
for text, true, pred in errors:
label_true = "Pos" if true == 1 else "Neg"
label_pred = "Pos" if pred == 1 else "Neg"
print(f" β '{text}'")
print(f" True: {label_true} | Predicted: {label_pred}")
# Analisis pola error
print(f"\n=== ERROR PATTERN ANALYSIS ===")
print(f"Common patterns in errors:")
print(f" - Frasa ambigu/lunak: 'cukup lumayan', 'biasa saja'")
print(f" - Negasi kompleks: 'kurang memuaskan'")
print(f" - Sarcastic undertone: 'lumayan bagus kok'")
print(f"\nAction items:")
print(f" 1. Tambah data training untuk edge cases")
print(f" 2. Data augmentation: paraphrase detection")
print(f" 3. Coba model lebih besar (BERT vs LR)")
print(f" 4. Ensemble beberapa model")
Best Practices Text Classification
- Mulai dari sederhana β TF-IDF + Logistic Regression sebagai baseline. Ini sering sudah sangat bagus
- Data quality > Model complexity β Data bersih dan label akurat lebih penting dari model canggih
- Handle class imbalance β Oversampling (SMOTE), undersampling, atau class_weight
- Data Augmentation β Paraphrase, back-translation, synonym replacement
- Cross-validation β Jangan evaluasi di satu split saja
- Error analysis β Selalu lihat di mana model salah. Pola error memberikan insight untuk improvement
- BERT untuk teks bahasa Indonesia β Gunakan
indobenchmark/indobert-base-p1ataubert-base-multilingual-cased - Ensemble β Gabungkan beberapa model (voting/stacking) untuk performa lebih baik
10. Quiz: Uji Pemahamanmu!
Setelah membaca tutorial di atas, jawablah 5 pertanyaan berikut untuk menguji pemahamanmu tentang NLP Text Classification: