AI & Data Science

NLP Text Classification

TOKEN

Tutorial lengkap NLP Text Classification β€” tokenization, word embeddings (Word2Vec, GloVe), model klasifikasi (Naive Bayes, LSTM, BERT), fine-tuning, dan implementasi praktis

1. Pengenalan NLP Text Classification

Text Classification adalah salah satu tugas paling fundamental dalam Natural Language Processing (NLP) β€” memberikan label/kategori pada teks berdasarkan isinya. Tugas ini ada di mana-mana: dari filter spam email, analisis sentimen review produk, hingga moderasi konten di media sosial.

Aplikasi Text Classification

Aplikasi Tipe Contoh
Sentiment AnalysisBinary/Multi-classReview: positif/negatif/netral
Spam DetectionBinaryEmail: spam/not spam
Topic ClassificationMulti-classBerita: olahraga/politik/teknologi
Intent DetectionMulti-classChatbot: beli/cek_status/komplain
Language DetectionMulti-classID/EN/MS/ZH
Emotion DetectionMulti-classSenang/sedih/marah/takut
Hate Speech DetectionBinaryKonten ofensif/aman
Spam Review DetectionBinaryReview palsu/asli

Pipeline Text Classification

Diagram: Pipeline Text Classification
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Raw Text │──►│ Text      │──►│ Tokenization │──►│ Vectori- β”‚   β”‚          β”‚
β”‚ "Produk  β”‚   β”‚ Preprocessβ”‚   β”‚              β”‚   β”‚ zation / │──►│ Model    β”‚
β”‚  ini     β”‚   β”‚           β”‚   β”‚              β”‚   β”‚ Embeddingβ”‚   β”‚          β”‚
β”‚  bagus!" β”‚   β”‚           β”‚   β”‚              β”‚   β”‚          β”‚   β”‚          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
                                                                      β”‚
  Pipeline Tradisional:                                                β–Ό
  1. Lowercase, hapus noise                                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  2. Tokenisasi kata                                           β”‚ Output   β”‚
  3. Bag-of-Words / TF-IDF                                    β”‚ Label    β”‚
  4. Naive Bayes / SVM                                         β”‚ (Positif)β”‚
                                                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  Pipeline Modern:
  1. Tokenisasi subword (BPE/WordPiece)
  2. Pre-trained embeddings (BERT)
  3. Fine-tune Transformer model

2. Text Preprocessing

Text preprocessing adalah langkah penting untuk membersihkan dan menormalisasi teks sebelum dijadikan input model. Kualitas preprocessing sangat mempengaruhi hasil akhir model.

Langkah-Langkah Preprocessing

Langkah Penjelasan Contoh
LowercasingUbah semua huruf jadi kecil"Bagus!" β†’ "bagus!"
Hapus HTML TagsBersihkan markup HTML"<p>teks</p>" β†’ "teks"
Hapus URLBuang link/url"kunjungi http://x.com" β†’ "kunjungi"
Hapus Special CharactersBuang simbol, angka tertentu"good!!! #nice" β†’ "good nice"
Hapus StopwordsBuang kata umum (di, ke, yang, dan)"saya suka makan nasi" β†’ "suka makan nasi"
StemmingPotong imbuhan β†’ akar kata"bermain" β†’ "main", "kebersihan" β†’ "bersih"
LemmaUbah ke bentuk dasar (lebih akurat dari stemming)"running" β†’ "run", "better" β†’ "good"
Hapus EmojiBuang emoji dari teks"bagus πŸ‘πŸ‘" β†’ "bagus"
⚠️ Catatan untuk Model Modern (BERT, GPT)

Model Transformer modern seperti BERT dan GPT TIDAK membutuhkan preprocessing agresif! Stopwords, punctuation, bahkan kapitalisasi membawa informasi penting. Untuk BERT/GPT: cukup tokenisasi subword (WordPiece/BPE) tanpa stemming atau hapus stopwords.

Python β€” Text Preprocessing
import re
import string

# =============================================
# TEXT PREPROCESSING FUNCTIONS
# =============================================

def preprocess_text(text, remove_stopwords=False, do_stemming=False):
    """Pipeline preprocessing untuk teks bahasa Indonesia."""
    
    # 1. Lowercase
    text = text.lower()
    
    # 2. Hapus HTML tags
    text = re.sub(r'<[^>]+>', '', text)
    
    # 3. Hapus URLs
    text = re.sub(r'http\S+|www\.\S+', '', text)
    
    # 4. Hapus email
    text = re.sub(r'\S+@\S+', '', text)
    
    # 5. Hapus emoji dan special characters
    text = re.sub(r'[^\w\s]', ' ', text)  # Keep only words & spaces
    
    # 6. Hapus angka
    text = re.sub(r'\d+', '', text)
    
    # 7. Hapus extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    # 8. Hapus stopwords (optional)
    if remove_stopwords:
        # Stopwords bahasa Indonesia (contoh sederhana)
        stop_words = {
            'yang', 'di', 'dan', 'ini', 'itu', 'dengan', 'untuk',
            'pada', 'ke', 'dari', 'ada', 'adalah', 'akan', 'juga',
            'saya', 'kamu', 'dia', 'mereka', 'kami', 'kita', 'bisa',
            'tidak', 'bukan', 'belum', 'sudah', 'telah', 'atau',
            'tapi', 'namun', 'jika', 'maka', 'serta', 'oleh', 'lebih'
        }
        tokens = text.split()
        tokens = [t for t in tokens if t not in stop_words]
        text = ' '.join(tokens)
    
    return text

# Contoh penggunaan
texts_raw = [
    "Produk ini BAGUS banget!!! 😍😍 http://shopee.com",
    "

Saya sangat kecewa dengan pelayanan yang buruk :(

", "Harga Rp 50.000 sangat murah untuk kualitas ini! πŸ‘", "Barang datang dalam 3 hari. Packaging rapi. Recommended!!!", "Saya sudah kirim email ke support@toko.com tapi tidak dibalas" ] print("=" * 70) print("TEXT PREPROCESSING RESULTS") print("=" * 70) for text in texts_raw: clean = preprocess_text(text, remove_stopwords=True) print(f"\nOriginal: {text[:60]}...") print(f"Cleaned: {clean}")

3. Tokenization

Tokenization adalah proses memecah teks menjadi unit-unit yang lebih kecil (token). Ada beberapa level tokenisasi dan metode yang berbeda-beda.

Jenis Tokenization

Diagram: Jenis Tokenization
1. WORD-LEVEL TOKENIZATION (Tradisional):
   "Saya suka Machine Learning" β†’ ["Saya", "suka", "Machine", "Learning"]
   
   βœ… Mudah dipahami
   ❌ Vocabulary besar (100k+ kata)
   ❌ Out-of-Vocabulary (OOV) problem: kata baru tidak dikenal
   ❌ Tidak bisa handle typo: "machne" β†’ ???

2. CHARACTER-LEVEL TOKENIZATION:
   "Hello" β†’ ['H', 'e', 'l', 'l', 'o']
   
   βœ… Vocabulary sangat kecil (~200 karakter)
   βœ… Zero OOV problem
   ❌ Sequence sangat panjang
   ❌ Sulit menangkap makna kata

3. SUBWORD TOKENIZATION (Modern! ⭐):
   "unhappiness" β†’ ["un", "##happi", "##ness"]
   
   βœ… Vocabulary moderat (30k-50k)
   βœ… Sangat sedikit OOV
   βœ… Sequence tidak terlalu panjang
   βœ… Menangkap morfologi kata
   
   Metode:
   β€’ BPE (Byte Pair Encoding) β€” GPT, LLaMA
   β€’ WordPiece β€” BERT
   β€’ SentencePiece β€” T5, mBART
   β€’ Unigram β€” XLNet
Python β€” Tokenization Methods
from transformers import AutoTokenizer
import json

# =============================================
# 1. WORD TOKENIZATION (sederhana)
# =============================================
def simple_tokenize(text):
    """Tokenisasi sederhana: split + lowercase."""
    return text.lower().split()

text = "Natural Language Processing sangat menarik untuk dipelajari!"
tokens = simple_tokenize(text)
print(f"Word tokens: {tokens}")
print(f"Token count: {len(tokens)}")

# =============================================
# 2. SUBWORD TOKENIZATION dengan HuggingFace
# =============================================
# BERT WordPiece tokenizer
bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# GPT BPE tokenizer
gpt_tokenizer = AutoTokenizer.from_pretrained('gpt2')

# Multilingual tokenizer (support bahasa Indonesia!)
mbert_tokenizer = AutoTokenizer.from_pretrained('bert-base-multilingual-cased')

text_en = "Natural Language Processing is fascinating!"
text_id = "Pemrosesan Bahasa Alami sangat menarik!"

print("\n" + "=" * 60)
print("SUBWORD TOKENIZATION COMPARISON")
print("=" * 60)

# English text
print(f"\nEnglish: '{text_en}'")
print(f"BERT (WordPiece): {bert_tokenizer.tokenize(text_en)}")
print(f"GPT-2 (BPE):      {gpt_tokenizer.tokenize(text_en)}")

# Indonesian text
print(f"\nIndonesian: '{text_id}'")
print(f"mBERT (WordPiece): {mbert_tokenizer.tokenize(text_id)}")

# Show token IDs
tokens_en = bert_tokenizer(text_en, return_tensors='pt')
print(f"\nBERT Token IDs: {tokens_en['input_ids'].tolist()}")
print(f"Decoded: {bert_tokenizer.decode(tokens_en['input_ids'][0])}")

# Special tokens
print(f"\n=== BERT Special Tokens ===")
print(f"[CLS]: {bert_tokenizer.cls_token_id}")
print(f"[SEP]: {bert_tokenizer.sep_token_id}")
print(f"[PAD]: {bert_tokenizer.pad_token_id}")
print(f"[UNK]: {bert_tokenizer.unk_token_id}")
print(f"[MASK]: {bert_tokenizer.mask_token_id}")

# Vocabulary size
print(f"\n=== Vocabulary Size ===")
print(f"BERT (uncased): {bert_tokenizer.vocab_size:,}")
print(f"GPT-2: {gpt_tokenizer.vocab_size:,}")
print(f"mBERT (multilingual): {mbert_tokenizer.vocab_size:,}")

4. Text Representation

Setelah teks di-tokenisasi, kita perlu mengubah token menjadi vektor numerik yang bisa diproses oleh model. Ada beberapa metode dari yang paling sederhana hingga yang paling canggih.

Bag-of-Words (BoW)

Bag-of-Words merepresentasikan teks berdasarkan frekuensi kemunculan setiap kata. Urutan kata diabaikan β€” hanya hitungan yang diperhatikan.

TF-IDF

TF-IDF (Term Frequency - Inverse Document Frequency) memperbaiki BoW dengan memperhitungkan pentingnya kata dalam konteks keseluruhan korpus.

Python β€” BoW dan TF-IDF
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Contoh corpus
corpus = [
    "Saya suka makan nasi goreng",
    "Nasi goreng enak sekali",
    "Saya suka minum kopi",
    "Kopi dan nasi goreng adalah makanan favorit saya",
    "Makan nasi goreng sambil minum kopi sangat nikmat"
]

# =============================================
# 1. BAG OF WORDS (BoW)
# =============================================
bow = CountVectorizer()
bow_matrix = bow.fit_transform(corpus)

print("=" * 60)
print("BAG OF WORDS (BoW)")
print("=" * 60)
print(f"Vocabulary: {bow.get_feature_names_out()}")
print(f"Matrix shape: {bow_matrix.shape}")
print(f"\nBoW Matrix (first 3 docs):")
for i in range(3):
    doc_vector = bow_matrix.toarray()[i]
    nonzero = {k: v for k, v in zip(bow.get_feature_names_out(), doc_vector) if v > 0}
    print(f"  Doc {i+1}: {nonzero}")

# =============================================
# 2. TF-IDF
# =============================================
tfidf = TfidfVectorizer(max_features=20, ngram_range=(1, 2))
tfidf_matrix = tfidf.fit_transform(corpus)

print("\n" + "=" * 60)
print("TF-IDF")
print("=" * 60)
print(f"Features: {tfidf.get_feature_names_out()}")
print(f"Matrix shape: {tfidf_matrix.shape}")

# Show top TF-IDF terms per document
print(f"\nTop terms per document:")
for i, doc in enumerate(corpus):
    feature_names = tfidf.get_feature_names_out()
    tfidf_scores = tfidf_matrix.toarray()[i]
    top_indices = np.argsort(tfidf_scores)[::-1][:3]
    top_terms = [(feature_names[j], tfidf_scores[j]) for j in top_indices if tfidf_scores[j] > 0]
    print(f"  Doc {i+1}: {top_terms}")

# =============================================
# 3. N-GRAMS
# =============================================
print("\n" + "=" * 60)
print("N-GRAMS (Unigram, Bigram, Trigram)")
print("=" * 60)

text = "saya sangat suka belajar machine learning"
words = text.split()

# Unigrams
unigrams = words
print(f"Unigrams: {unigrams}")

# Bigrams
bigrams = [f"{words[i]}_{words[i+1]}" for i in range(len(words)-1)]
print(f"Bigrams:  {bigrams}")

# Trigrams
trigrams = [f"{words[i]}_{words[i+1]}_{words[i+2]}" for i in range(len(words)-2)]
print(f"Trigrams: {trigrams}")

5. Word Embeddings (Word2Vec, GloVe)

Word Embeddings adalah representasi kata dalam bentuk vektor padat (dense vector) yang menangkap makna semantik. Berbeda dengan BoW/TF-IDF yang sparse dan tidak menangkap makna, embeddings menempatkan kata-kata yang bermakna mirip berdekatan dalam ruang vektor.

Word2Vec: "King - Man + Woman β‰ˆ Queen"

Diagram: Word Embeddings
WORD EMBEDDINGS SPACE (2D projection via t-SNE):

         animal
           ↑
    dog   cat
       \  /  
        \/        
  puppy /\hamster       ← Kata hewan berkelompok!
       /  \
           ↓
         food
    cake  bread
       \  /
        \/
         ↔ gender
  king  queen           ← "king - man + woman β‰ˆ queen"  
  man   woman
  boy   girl

PROPERTIES:
β€’ Similar words have similar vectors (cosine distance kecil)
β€’ Arithmetic semantics: vector("king") - vector("man") + vector("woman")
                         β‰ˆ vector("queen")
β€’ 50-300 dimensi (vs BoW: 10k+ dimensi sparse)

METODE POPULER:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Model       β”‚ Dimensi  β”‚ Metode Training               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Word2Vec    β”‚ 100-300  β”‚ Skip-gram / CBOW (Google 2013)β”‚
β”‚ GloVe       β”‚ 50-300   β”‚ Co-occurrence matrix (Stanfordβ”‚
β”‚ FastText    β”‚ 100-300  β”‚ Subword + Word2Vec (FB 2016)  β”‚
β”‚ ELMo        β”‚ 1024     β”‚ Bidirectional LSTM (2018)     β”‚
β”‚ BERT        β”‚ 768-1024 β”‚ Transformer (contextual, 2018)β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Note: Word2Vec/GloVe = NON-contextual (setiap kata = 1 vektor tetap)
      BERT/ELMo = CONTEXTUAL (vektor berubah berdasarkan kalimat!)
      "bank" di "river bank" vs "bank account" β†’ vektor berbeda!
Python β€” Word Embeddings
import numpy as np
from gensim.models import Word2Vec
from sklearn.metrics.pairwise import cosine_similarity

# =============================================
# 1. TRAIN WORD2VEC dari data lokal
# =============================================
# Contoh korpus (bisa dari dataset besar)
corpus = [
    ['saya', 'suka', 'makan', 'nasi', 'goreng'],
    ['nasi', 'goreng', 'enak', 'sekali'],
    ['saya', 'suka', 'minum', 'kopi', 'panas'],
    ['kopi', 'dan', 'nasi', 'goreng', 'makanan', 'favorit'],
    ['makan', 'nasi', 'goreng', 'sambil', 'minum', 'kopi'],
    ['saya', 'suka', 'makan', 'bakso', 'dan', 'soto'],
    ['bakso', 'adalah', 'makanan', 'yang', 'enak'],
    ['soto', 'juga', 'makanan', 'favorit', 'saya'],
]

# Train Word2Vec (Skip-gram)
model_w2v = Word2Vec(
    sentences=corpus,
    vector_size=50,     # Embedding dimensions
    window=3,           # Context window size
    min_count=1,        # Minimum word frequency
    sg=1,               # 1=Skip-gram, 0=CBOW
    epochs=100,
    seed=42
)

print("=" * 60)
print("WORD2VEC RESULTS")
print("=" * 60)

# Get embedding for a word
word = 'nasi'
print(f"\nEmbedding for '{word}': {model_w2v.wv[word][:10]}...")  # first 10 dims
print(f"Embedding shape: {model_w2v.wv[word].shape}")

# Most similar words
print(f"\nMost similar to 'nasi':")
for word, score in model_w2v.wv.most_similar('nasi', topn=5):
    print(f"  {word}: {score:.4f}")

# Word similarity
print(f"\nSimilarity between pairs:")
pairs = [
    ('nasi', 'goreng'),
    ('nasi', 'bakso'),
    ('saya', 'goreng'),
    ('kopi', 'makanan'),
]
for w1, w2 in pairs:
    sim = model_w2v.wv.similarity(w1, w2)
    print(f"  sim('{w1}', '{w2}') = {sim:.4f}")

# Analogy: king - man + woman = queen (with real embeddings)
# With small corpus, we demonstrate the concept
print(f"\nWord analogy (simulated):")
print(f"  nasi β‰ˆ makanan (staple food)")
print(f"  kopi β‰ˆ minuman (drink)")

# =============================================
# 2. PRE-TRAINED EMBEDDINGS (GloVe)
# =============================================
# In practice, use pre-trained embeddings:
# Download from: https://nlp.stanford.edu/projects/glove/
# File: glove.6B.100d.txt (100-dimensional, 6B tokens)

print("\n" + "=" * 60)
print("PRE-TRAINED EMBEDDINGS INFO")
print("=" * 60)
print("GloVe (Stanford NLP):")
print("  - glove.6B.50d.txt  (50 dim, 400K vocab)")
print("  - glove.6B.100d.txt (100 dim, 400K vocab)")
print("  - glove.6B.200d.txt (200 dim, 400K vocab)")
print("  - glove.6B.300d.txt (300 dim, 400K vocab)")
print("\nFastText (Facebook):")
print("  - 157 languages, 300 dim")
print("  - Bisa handle kata OOV via subword!")
print("  - Download: https://fasttext.cc/docs/en/crawl-vectors.html")
print("\nUntuk bahasa Indonesia:")
print("  - fastText.id.300d.txt")
print("  - word2vec.id.300d.txt (from various sources)")

6. Model Klasifikasi Tradisional

Sebelum deep learning mendominasi, ada beberapa model machine learning tradisional yang sangat efektif untuk text classification, terutama dengan TF-IDF features.

Python β€” Text Classification dengan ML Tradisional
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')

# =============================================
# DATASET: Simulated Sentiment Analysis
# =============================================
# (Di real-world: gunakan dataset dari Kaggle, dsb.)
texts = [
    "Produk ini sangat bagus dan berkualitas tinggi",
    "Saya sangat puas dengan pembelian ini",
    "Barang cepat sampai dan sesuai deskripsi",
    "Kualitas sangat baik, recommended banget",
    "Pelayanan ramah dan pengiriman cepat",
    "Sangat senang dengan hasilnya, bagus sekali",
    "Makanan enak dan pelayanan memuaskan",
    "Gadget ini canggih dan berfungsi sempurna",
    "Harga sebanding dengan kualitas",
    "Akan beli lagi di toko ini, top!",
    "Produk sangat mengecewakan dan rusak",
    "Barang tidak sesuai foto, sangat buruk",
    "Pengiriman sangat lambat dan tidak aman",
    "Kualitas jelek, buang uang saja",
    "Pelayanan buruk, tidak ramah",
    "Barang palsu dan tidak berfungsi",
    "Sangat kecewa dengan produk ini",
    "Makanan tidak enak dan pelayanan lambat",
    "Harga mahal tapi kualitas rendah",
    "Tidak akan pernah beli di sini lagi"
] * 25  # Repeat for more data

labels = ([1]*10 + [0]*10) * 25  # 1=positif, 0=negatif

X_train, X_test, y_train, y_test = train_test_split(
    texts, labels, test_size=0.2, random_state=42, stratify=labels
)

print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")

# =============================================
# COMPARING CLASSIFIERS
# =============================================
classifiers = {
    'Naive Bayes': MultinomialNB(alpha=1.0),
    'Logistic Regression': LogisticRegression(max_iter=1000, C=1.0),
    'Linear SVM': LinearSVC(C=1.0, max_iter=5000),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
}

print("\n" + "=" * 60)
print("COMPARISON: TF-IDF + CLASSIFIERS")
print("=" * 60)

results = {}
for name, clf in classifiers.items():
    # Pipeline: TF-IDF β†’ Classifier
    pipeline = Pipeline([
        ('tfidf', TfidfVectorizer(max_features=5000, ngram_range=(1, 2))),
        ('clf', clf)
    ])
    
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    cv_scores = cross_val_score(pipeline, texts, labels, cv=5, scoring='accuracy')
    
    results[name] = {
        'accuracy': acc,
        'cv_mean': cv_scores.mean(),
        'cv_std': cv_scores.std()
    }
    
    print(f"\n{name}:")
    print(f"  Test Accuracy: {acc:.4f}")
    print(f"  CV Accuracy: {cv_scores.mean():.4f} Β± {cv_scores.std():.4f}")

# Best model
best_name = max(results, key=lambda x: results[x]['cv_mean'])
print(f"\n{'='*60}")
print(f"BEST MODEL: {best_name} (CV: {results[best_name]['cv_mean']:.4f})")
print(f"{'='*60}")

# =============================================
# PREDICTION EXAMPLE
# =============================================
best_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=5000, ngram_range=(1, 2))),
    ('clf', classifiers[best_name])
])
best_pipeline.fit(X_train, y_train)

new_texts = [
    "Produk ini luar biasa bagus, saya sangat puas!",
    "Barang rusak dan pelayanan sangat buruk",
    "Biasa saja, tidak terlalu bagus tapi tidak jelek juga",
    "Pengiriman cepat dan packaging rapi, terima kasih!"
]

predictions = best_pipeline.predict(new_texts)
labels_map = {0: 'Negatif', 1: 'Positif'}

print(f"\n=== Predictions (Best: {best_name}) ===")
for text, pred in zip(new_texts, predictions):
    print(f"  [{labels_map[pred]:>8}] {text}")

7. Deep Learning untuk NLP

Evolusi Model NLP

Diagram: Evolusi NLP Models
Evolusi NLP Models:

2003: Neural LM (Bengio)
  β”‚   β€’ Word embeddings pertama
  β”‚
2013: Word2Vec (Mikolov)
  β”‚   β€’ Pre-trained word vectors β†’ transfer learning
  β”‚   β€’ "king - man + woman β‰ˆ queen"
  β”‚
2014: seq2seq + Attention (Sutskever, Bahdanau)
  β”‚   β€’ Encoder-Decoder LSTM
  β”‚   β€’ Neural machine translation
  β”‚
2018: ELMo (Peters)
  β”‚   β€’ Contextual word embeddings
  β”‚   β€’ Same word β†’ different vectors per context
  β”‚
2018: BERT (Google) ⭐
  β”‚   β€’ Bidirectional Transformer encoder
  β”‚   β€’ Pre-train β†’ Fine-tune paradigm
  β”‚   β€’ SOTA di 11 NLP tasks sekaligus!
  β”‚
2019: GPT-2 (OpenAI)
  β”‚   β€’ Autoregressive Transformer decoder
  β”‚   β€’ Generative language model
  β”‚
2020: GPT-3 (OpenAI)
  β”‚   β€’ 175B parameters
  β”‚   β€’ In-context learning (few-shot)
  β”‚
2022: ChatGPT / InstructGPT
  β”‚   β€’ RLHF alignment
  β”‚   β€’ Conversational AI
  β”‚
2023-2025: GPT-4, LLaMA, Gemini, Claude
      β€’ Multi-modal
      β€’ Long context (100k+ tokens)
      β€’ Reasoning capabilities
Python β€” LSTM Text Classifier dengan PyTorch
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from collections import Counter
import numpy as np

# =============================================
# 1. VOCABULARY BUILDER
# =============================================
class Vocabulary:
    def __init__(self, max_size=10000):
        self.word2idx = {'': 0, '': 1}
        self.idx2word = {0: '', 1: ''}
        self.max_size = max_size
    
    def build(self, texts):
        counter = Counter()
        for text in texts:
            counter.update(text.lower().split())
        
        most_common = counter.most_common(self.max_size - 2)
        for idx, (word, count) in enumerate(most_common, start=2):
            self.word2idx[word] = idx
            self.idx2word[idx] = word
    
    def encode(self, text, max_len=50):
        tokens = text.lower().split()
        ids = [self.word2idx.get(t, 1) for t in tokens[:max_len]]
        # Pad
        ids += [0] * (max_len - len(ids))
        return ids
    
    def __len__(self):
        return len(self.word2idx)

# =============================================
# 2. DATASET
# =============================================
class TextDataset(Dataset):
    def __init__(self, texts, labels, vocab, max_len=50):
        self.texts = texts
        self.labels = labels
        self.vocab = vocab
        self.max_len = max_len
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        encoded = self.vocab.encode(self.texts[idx], self.max_len)
        return torch.tensor(encoded, dtype=torch.long), torch.tensor(self.labels[idx], dtype=torch.long)

# =============================================
# 3. LSTM MODEL
# =============================================
class LSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes, 
                 num_layers=2, dropout=0.3, bidirectional=True):
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.lstm = nn.LSTM(
            embed_dim, hidden_dim,
            num_layers=num_layers,
            batch_first=True,
            bidirectional=bidirectional,
            dropout=dropout
        )
        self.dropout = nn.Dropout(dropout)
        
        # Bidirectional β†’ hidden_dim * 2
        direction_factor = 2 if bidirectional else 1
        self.fc = nn.Sequential(
            nn.Linear(hidden_dim * direction_factor, hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, num_classes)
        )
    
    def forward(self, x):
        # x: (batch, seq_len)
        embedded = self.embedding(x)  # (batch, seq_len, embed_dim)
        lstm_out, (hidden, cell) = self.lstm(embedded)
        
        # Concatenate final hidden state dari kedua arah
        if self.lstm.bidirectional:
            hidden_cat = torch.cat([hidden[-2], hidden[-1]], dim=1)
        else:
            hidden_cat = hidden[-1]
        
        out = self.dropout(hidden_cat)
        out = self.fc(out)
        return out

# =============================================
# 4. TRAINING
# =============================================
# Sample data
texts = [
    "Produk sangat bagus dan berkualitas", "Saya sangat puas dengan layanan",
    "Barang rusak dan tidak sesuai", "Pelayanan sangat buruk dan lambat",
    "Makanan enak dan porsi besar", "Kualitas sangat memuaskan",
    "Barang palsu dan tidak original", "Sangat kecewa dengan pembelian",
    "Pengiriman cepat dan aman", "Harga terjangkau kualitas bagus",
] * 30

labels = [1, 1, 0, 0, 1, 1, 0, 0, 1, 1] * 30

# Build vocabulary
vocab = Vocabulary(max_size=1000)
vocab.build(texts)

# Create datasets
train_dataset = TextDataset(texts[:200], labels[:200], vocab)
test_dataset = TextDataset(texts[200:], labels[200:], vocab)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

# Model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = LSTMClassifier(
    vocab_size=len(vocab),
    embed_dim=64,
    hidden_dim=128,
    num_classes=2,
    num_layers=2,
    bidirectional=True
).to(device)

optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

# Train loop (simplified)
print("Training LSTM Text Classifier...")
print(f"Vocab size: {len(vocab)}")
print(f"Model params: {sum(p.numel() for p in model.parameters()):,}")

for epoch in range(10):
    model.train()
    total_loss = 0
    correct = 0
    total = 0
    
    for texts_batch, labels_batch in train_loader:
        texts_batch = texts_batch.to(device)
        labels_batch = labels_batch.to(device)
        
        optimizer.zero_grad()
        outputs = model(texts_batch)
        loss = criterion(outputs, labels_batch)
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
        _, predicted = outputs.max(1)
        total += labels_batch.size(0)
        correct += predicted.eq(labels_batch).sum().item()
    
    train_acc = 100. * correct / total
    
    # Test
    model.eval()
    test_correct = 0
    test_total = 0
    with torch.no_grad():
        for texts_batch, labels_batch in test_loader:
            texts_batch = texts_batch.to(device)
            labels_batch = labels_batch.to(device)
            outputs = model(texts_batch)
            _, predicted = outputs.max(1)
            test_total += labels_batch.size(0)
            test_correct += predicted.eq(labels_batch).sum().item()
    
    test_acc = 100. * test_correct / test_total
    print(f"Epoch [{epoch+1:2d}/10] Loss: {total_loss:.4f} "
          f"Train: {train_acc:.1f}% Test: {test_acc:.1f}%")

8. Fine-tuning BERT untuk Klasifikasi

Fine-tuning BERT adalah pendekatan terkuat untuk text classification saat ini. Kita mengambil BERT yang sudah pre-trained di korpus besar (Wikipedia, Books) dan melatih ulang (fine-tune) pada dataset spesifik kita.

Python β€” Fine-tuning BERT dengan Hugging Face
import torch
from transformers import (
    BertTokenizer, BertForSequenceClassification,
    AdamW, get_linear_schedule_with_warmup
)
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import numpy as np

# =============================================
# 1. CUSTOM DATASET
# =============================================
class TextClassificationDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        encoding = self.tokenizer(
            self.texts[idx],
            max_length=self.max_len,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        return {
            'input_ids': encoding['input_ids'].squeeze(0),
            'attention_mask': encoding['attention_mask'].squeeze(0),
            'labels': torch.tensor(self.labels[idx], dtype=torch.long)
        }

# =============================================
# 2. PREPARE DATA
# =============================================
texts = [
    "Produk ini sangat bagus dan berkualitas tinggi",
    "Saya sangat puas dengan pembelian ini",
    "Barang cepat sampai dan sesuai deskripsi",
    "Kualitas sangat baik, recommended",
    "Pelayanan ramah dan pengiriman cepat",
    "Sangat senang dengan hasilnya",
    "Produk sangat mengecewakan dan rusak",
    "Barang tidak sesuai foto, sangat buruk",
    "Pengiriman sangat lambat",
    "Kualitas jelek, buang uang saja",
    "Pelayanan buruk dan tidak ramah",
    "Barang palsu dan tidak berfungsi",
] * 30  # Repeat for enough data

labels = [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0] * 30

X_train, X_val, y_train, y_val = train_test_split(
    texts, labels, test_size=0.2, random_state=42, stratify=labels
)

# =============================================
# 3. LOAD BERT
# =============================================
MODEL_NAME = 'indobenchmark/indobert-base-p1'  # BERT bahasa Indonesia!
# Alternatif: 'bert-base-multilingual-cased' (multilingual)

tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)
model = BertForSequenceClassification.from_pretrained(
    MODEL_NAME, num_labels=2
)

print(f"Model: {MODEL_NAME}")
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

# Create DataLoaders
train_dataset = TextClassificationDataset(X_train, y_train, tokenizer, max_len=128)
val_dataset = TextClassificationDataset(X_val, y_val, tokenizer, max_len=128)

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16, shuffle=False)

# =============================================
# 4. FINE-TUNING
# =============================================
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

# Optimizer with weight decay
optimizer = AdamW(model.parameters(), lr=2e-5, weight_decay=0.01)

# Learning rate scheduler
num_epochs = 5
total_steps = len(train_loader) * num_epochs
scheduler = get_linear_schedule_with_warmup(
    optimizer, 
    num_warmup_steps=int(0.1 * total_steps),  # 10% warmup
    num_training_steps=total_steps
)

# Training loop
print("\nFine-tuning BERT...")
for epoch in range(num_epochs):
    # TRAIN
    model.train()
    total_loss = 0
    train_preds, train_true = [], []
    
    for batch in train_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels_batch = batch['labels'].to(device)
        
        optimizer.zero_grad()
        outputs = model(input_ids=input_ids, attention_mask=attention_mask,
                       labels=labels_batch)
        loss = outputs.loss
        logits = outputs.logits
        
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()
        scheduler.step()
        
        total_loss += loss.item()
        preds = torch.argmax(logits, dim=1).cpu().numpy()
        train_preds.extend(preds)
        train_true.extend(labels_batch.cpu().numpy())
    
    train_acc = accuracy_score(train_true, train_preds)
    avg_loss = total_loss / len(train_loader)
    
    # VALIDATION
    model.eval()
    val_preds, val_true = [], []
    
    with torch.no_grad():
        for batch in val_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels_batch = batch['labels'].to(device)
            
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            preds = torch.argmax(outputs.logits, dim=1).cpu().numpy()
            
            val_preds.extend(preds)
            val_true.extend(labels_batch.cpu().numpy())
    
    val_acc = accuracy_score(val_true, val_preds)
    
    print(f"Epoch [{epoch+1}/{num_epochs}] "
          f"Loss: {avg_loss:.4f} | "
          f"Train Acc: {train_acc:.4f} | "
          f"Val Acc: {val_acc:.4f}")

# =============================================
# 5. INFERENCE
# =============================================
model.eval()
new_texts = [
    "Produk luar biasa bagus, sangat puas!",
    "Barang rusak dan pelayanan sangat buruk",
    "Cukup bagus untuk harga segitu"
]

print("\n=== Predictions ===")
for text in new_texts:
    encoding = tokenizer(text, return_tensors='pt', max_length=128,
                        padding='max_length', truncation=True)
    input_ids = encoding['input_ids'].to(device)
    attention_mask = encoding['attention_mask'].to(device)
    
    with torch.no_grad():
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        probs = torch.softmax(outputs.logits, dim=1)
        pred = torch.argmax(probs, dim=1).item()
    
    label = "Positif" if pred == 1 else "Negatif"
    conf = probs[0][pred].item()
    print(f"  [{label} ({conf:.2%})] {text}")
πŸ’‘ Tips Fine-tuning BERT
  • Learning rate sangat penting: Gunakan 2e-5 hingga 5e-5 (jangan terlalu besar!)
  • Batch size kecil: 16 atau 32. Jika GPU memory kurang, gunakan gradient accumulation
  • Epochs: 3-5 epochs biasanya sudah cukup (jangan terlalu banyak β†’ overfitting)
  • Warmup: 10% dari total steps untuk warmup
  • Weight decay: 0.01 untuk regularisasi
  • Gunakan BERT Indonesia: indobenchmark/indobert-base-p1 untuk teks bahasa Indonesia
  • Max length: 128 atau 256 (sesuaikan dengan panjang teks rata-rata)
  • Mixed precision: torch.cuda.amp untuk mempercepat training 2-3Γ—

9. Evaluasi & Best Practices

Metrik Evaluasi

Metrik Formula Kapan Digunakan
Accuracy(TP + TN) / TotalDataset seimbang
PrecisionTP / (TP + FP)Ketika FP mahal (spam: jangan salah blok email normal)
RecallTP / (TP + FN)Ketika FN mahal (deteksi penyakit: jangan lewatkan kasus positif)
F1-Score2 Γ— (P Γ— R) / (P + R)Dataset tidak seimbang, trade-off P dan R
AUC-ROCArea under ROC curveThreshold-independent evaluation
Confusion MatrixTP, TN, FP, FNSelalu! Visualisasi jenis error

Error Analysis

Python β€” Error Analysis
from sklearn.metrics import confusion_matrix, classification_report
import numpy as np

# Simulated predictions
y_true = [1, 1, 1, 0, 0, 0, 1, 1, 0, 0]
y_pred = [1, 1, 0, 0, 0, 1, 1, 0, 0, 1]
texts_test = [
    "Produk sangat bagus",        # True: Pos, Pred: Pos βœ…
    "Saya puas sekali",            # True: Pos, Pred: Pos βœ…
    "Cukup lumayan lah",           # True: Pos, Pred: Neg ❌ (edge case!)
    "Barang rusak parah",          # True: Neg, Pred: Neg βœ…
    "Tidak sesuai ekspektasi",     # True: Neg, Pred: Neg βœ…
    "Biasa saja sih",              # True: Neg, Pred: Pos ❌ (edge case!)
    "Luar biasa bagusnya",        # True: Pos, Pred: Pos βœ…
    "Kurang memuaskan",            # True: Pos, Pred: Neg ❌
    "Pelayanan sangat buruk",     # True: Neg, Pred: Neg βœ…
    "Lumayan bagus kok",          # True: Neg, Pred: Pos ❌ (edge case!)
]

# Confusion matrix
cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:")
print(f"             Pred Neg  Pred Pos")
print(f"  True Neg:  {cm[0][0]:>6}    {cm[0][1]:>6}")
print(f"  True Pos:  {cm[1][0]:>6}    {cm[1][1]:>6}")

print(f"\nClassification Report:")
print(classification_report(y_true, y_pred, target_names=['Negatif', 'Positif']))

# Error Analysis: Kumpulkan salah prediksi
print("\n=== ERROR ANALYSIS ===")
errors = []
for i, (true, pred, text) in enumerate(zip(y_true, y_pred, texts_test)):
    if true != pred:
        errors.append((text, true, pred))

print(f"Total errors: {len(errors)}/{len(y_true)}")
print(f"Error types:")
for text, true, pred in errors:
    label_true = "Pos" if true == 1 else "Neg"
    label_pred = "Pos" if pred == 1 else "Neg"
    print(f"  ❌ '{text}'")
    print(f"     True: {label_true} | Predicted: {label_pred}")

# Analisis pola error
print(f"\n=== ERROR PATTERN ANALYSIS ===")
print(f"Common patterns in errors:")
print(f"  - Frasa ambigu/lunak: 'cukup lumayan', 'biasa saja'")
print(f"  - Negasi kompleks: 'kurang memuaskan'")
print(f"  - Sarcastic undertone: 'lumayan bagus kok'")
print(f"\nAction items:")
print(f"  1. Tambah data training untuk edge cases")
print(f"  2. Data augmentation: paraphrase detection")
print(f"  3. Coba model lebih besar (BERT vs LR)")
print(f"  4. Ensemble beberapa model")

Best Practices Text Classification

πŸ’‘ Best Practices
  1. Mulai dari sederhana β€” TF-IDF + Logistic Regression sebagai baseline. Ini sering sudah sangat bagus
  2. Data quality > Model complexity β€” Data bersih dan label akurat lebih penting dari model canggih
  3. Handle class imbalance β€” Oversampling (SMOTE), undersampling, atau class_weight
  4. Data Augmentation β€” Paraphrase, back-translation, synonym replacement
  5. Cross-validation β€” Jangan evaluasi di satu split saja
  6. Error analysis β€” Selalu lihat di mana model salah. Pola error memberikan insight untuk improvement
  7. BERT untuk teks bahasa Indonesia β€” Gunakan indobenchmark/indobert-base-p1 atau bert-base-multilingual-cased
  8. Ensemble β€” Gabungkan beberapa model (voting/stacking) untuk performa lebih baik

10. Quiz: Uji Pemahamanmu!

Setelah membaca tutorial di atas, jawablah 5 pertanyaan berikut untuk menguji pemahamanmu tentang NLP Text Classification:

Pertanyaan 1: Apa keunggulan utama subword tokenization (BPE/WordPiece) dibandingkan word-level tokenization?

a) Menghasilkan vektor yang lebih besar
b) Mengurangi Out-of-Vocabulary (OOV) problem dengan memecah kata tidak dikenal menjadi subword
c) Tidak membutuhkan training
d) Hanya bisa digunakan untuk bahasa Inggris

Pertanyaan 2: Mengapa TF-IDF lebih baik dari Bag-of-Words (BoW)?

a) TF-IDF menghitung frekuensi kata
b) TF-IDF memperhitungkan pentingnya kata relatif terhadap keseluruhan korpus
c) TF-IDF menangkap urutan kata
d) TF-IDF menggunakan neural network

Pertanyaan 3: Apa perbedaan BERT dan GPT dalam konteks text classification?

a) BERT dan GPT identik untuk semua tugas
b) BERT (encoder-only, bidirectional) lebih cocok untuk understanding; GPT (decoder-only, autoregressive) lebih cocok untuk generation
c) GPT selalu lebih baik dari BERT
d) BERT tidak bisa fine-tune untuk klasifikasi

Pertanyaan 4: Learning rate yang direkomendasikan untuk fine-tuning BERT adalah...

a) 0.01 (besar, agar cepat konvergen)
b) 2e-5 hingga 5e-5 (sangat kecil, agar tidak merusak pre-trained weights)
c) 0.1 (sangat besar)
d) 1.0 (tidak perlu tuning)

Pertanyaan 5: Metrik apa yang paling cocok untuk text classification dengan dataset sangat tidak seimbang?

a) Accuracy
b) F1-Score
c) Mean Squared Error
d) RΒ² Score
πŸ” Zoom
100%
🎨 Tema