Database

Vector Databases: Pinecone — Embeddings, Similarity Search & RAG

Pelajari Vector Database dan Pinecone dari dasar — embeddings, semantic search, similarity search, RAG pipeline, dan integrasi dengan LLM untuk aplikasi AI modern

1. Pengenalan Vector Database

Vector Database adalah database khusus yang menyimpan dan mengelola data dalam bentuk vektor (angka berdimensi tinggi). Berbeda dari database tradisional yang mencari data berdasarkan keyword yang persis sama, vector database mencari data berdasarkan kemiripan makna (semantic similarity).

Bayangkan Anda mencari "sepatu olahraga" di database tradisional — hanya menemukan dokumen yang mengandung kata "sepatu olahraga". Di vector database, Anda bisa menemukan dokumen tentang "running shoes", "sneakers untuk jogging", atau "footwear aktivitas fisik" — karena semua ini memiliki makna yang mirip.

Diagram: Vector Database Concept
┌─────────────────────────────────────────────────────────────────┐
│                  VECTOR DATABASE CONCEPT                         │
│                                                                 │
│  Input Text → Embedding Model → Vector (angka) → Simpan di DB  │
│                                                                 │
│  "sepatu olahraga"  →  [0.23, -0.45, 0.87, ..., 0.12]  (384D)│
│  "running shoes"    →  [0.25, -0.42, 0.85, ..., 0.15]  (384D)│
│  "sneakers joging"  →  [0.21, -0.48, 0.82, ..., 0.10]  (384D)│
│  "mobil sport"      →  [-0.67, 0.34, -0.21, ..., 0.78] (384D)│
│                                                                 │
│  Semantic Search: "sepatu lari" → vector query                  │
│                                                                 │
│  Hasil (by cosine similarity):                                  │
│  1. "running shoes"     → 0.98 (sangat mirip) ✅                │
│  2. "sepatu olahraga"   → 0.95 (mirip) ✅                      │
│  3. "sneakers joging"   → 0.93 (mirip) ✅                      │
│  4. "mobil sport"       → 0.12 (tidak mirip) ❌                 │
│                                                                 │
│  Data yang "bermakna sama" punya vektor yang berdekatan!         │
└─────────────────────────────────────────────────────────────────┘

Mengapa Vector Database Penting?

Use Case Contoh Aplikasi Mengapa Butuh Vector DB?
Semantic SearchSearch engine cerdasCari berdasarkan makna, bukan keyword
RAG (Retrieval Augmented Generation)ChatGPT + knowledge baseLLM bisa jawab pertanyaan dari data Anda
RecommendationProduk serupaTemukan item dengan embedding mirip
Image SearchPencarian gambar visualGambar → vektor → cari kemiripan
Anomaly DetectionDeteksi fraudData outlier punya vektor jauh dari cluster
ClusteringPengelompokan otomatisGroup data berdasarkan kemiripan semantik

Apa itu Pinecone?

Pinecone adalah vector database managed (fully hosted) yang paling populer. Keunggulan Pinecone: tidak perlu setup server, skalabilitas otomatis, latensi rendah (<50ms), dan integrasi mudah dengan ekosistem AI/ML seperti OpenAI, LangChain, dan LlamaIndex.

2. Embeddings — Representasi Data sebagai Vektor

Embedding adalah proses mengubah data (teks, gambar, audio) menjadi vektor angka berdimensi tinggi. Vektor ini merepresentasikan "makna" atau "fitur" dari data tersebut dalam ruang matematika.

Model Embedding Populer

Model Dimensi Provider Cocok Untuk
text-embedding-3-small1536OpenAITeks umum, cost-effective
text-embedding-3-large3072OpenAITeks, akurasi tinggi
all-MiniLM-L6-v2384Sentence TransformersGratis, cepat, lokal
multilingual-e5-large1024MicrosoftMultilingual (termasuk Indonesia)
embed-english-v3.01024CohereTeks bahasa Inggris
gecko768GoogleMulti-purpose
Python — Membuat Embeddings
# =============================================
# EMBEDDING dengan OpenAI
# =============================================
# pip install openai

import openai

client = openai.OpenAI(api_key="sk-...")

# Buat embedding untuk satu teks
response = client.embeddings.create(
    model="text-embedding-3-small",
    input="Mesin ini adalah panduan lengkap belajar Python untuk pemula"
)

vector = response.data[0].embedding
print(f"Dimensi: {len(vector)}")  # 1536
print(f"5 elemen pertama: {vector[:5]}")
# [0.0234, -0.0456, 0.0789, -0.0123, 0.0567]

# Buat embedding untuk banyak teks sekaligus (batch)
texts = [
    "Cara belajar Python untuk pemula",
    "Tutorial JavaScript dasar",
    "Panduan database MySQL",
    "Resep nasi goreng spesial",
    "Jadwal pertandingan sepak bola"
]

response = client.embeddings.create(
    model="text-embedding-3-small",
    input=texts
)

embeddings = [item.embedding for item in response.data]
print(f"Jumlah embedding: {len(embeddings)}")  # 5


# =============================================
# EMBEDDING dengan Sentence Transformers (GRATIS, LOKAL)
# =============================================
# pip install sentence-transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

# Embedding satu teks
vector = model.encode("Belajar Python dari nol")
print(f"Dimensi: {len(vector)}")  # 384

# Embedding batch
texts = [
    "Cara belajar Python untuk pemula",
    "Tutorial JavaScript dasar",
    "Panduan database MySQL",
    "Resep nasi goreng spesial"
]
vectors = model.encode(texts)
print(f"Shape: {vectors.shape}")  # (4, 384)


# =============================================
# EMBEDDING MULTILINGUAL (untuk bahasa Indonesia)
# =============================================
model_multi = SentenceTransformer('intfloat/multilingual-e5-large')

# Embedding bahasa Indonesia
vector_id = model_multi.encode("query: Apa itu machine learning?")
print(f"Dimensi: {len(vector_id)}")  # 1024
💡 Tips Embedding
  • Konsistensi model — gunakan model yang SAMA untuk indexing dan querying
  • Bahasa Indonesia — gunakan multilingual model (multilingual-e5, BGE-M3)
  • Chunking — pecah dokumen panjang menjadi chunk 200-500 token sebelum embedding
  • Prefix query — beberapa model butuh prefix "query:" atau "passage:"

3. Setup Pinecone

Python — Setup Pinecone
# =============================================
# STEP 1: Install
# =============================================
# pip install pinecone

# =============================================
# STEP 2: Inisialisasi Pinecone
# =============================================
from pinecone import Pinecone, ServerlessSpec

# Inisialisasi client
pc = Pinecone(api_key="YOUR_PINECONE_API_KEY")

# =============================================
# STEP 3: Buat Index (database untuk vektor)
# =============================================

# Cek index yang sudah ada
existing_indexes = pc.list_indexes().names()
print(f"Index yang ada: {existing_indexes}")

# Buat index baru jika belum ada
if 'tutorial-index' not in existing_indexes:
    pc.create_index(
        name='tutorial-index',
        dimension=1536,       # Sesuai dimensi embedding model
        metric='cosine',      # cosine, euclidean, dotproduct
        spec=ServerlessSpec(
            cloud='aws',
            region='us-east-1'
        )
    )
    print("Index 'tutorial-index' berhasil dibuat!")

# Connect ke index
index = pc.Index('tutorial-index')

# Cek statistik index
stats = index.describe_index_stats()
print(f"Total vectors: {stats.total_vector_count}")
print(f"Dimension: {stats.dimension}")
print(f"Namespaces: {list(stats.namespaces.keys())}")


# =============================================
# METRIC DISTANCE: Pilih yang sesuai
# =============================================
# cosine   → Cocok untuk semantic search (paling umum)
#            Mengukur sudut antara 2 vektor (0-1, 1=identik)
#
# euclidean → Mengukur jarak fisik antara 2 titik
#             Cocok untuk data numerik/spatial
#
# dotproduct → Seperti cosine tapi tanpa normalisasi
#              Cocok untuk vektor yang sudah normalized
Diagram: Similarity Metrics
┌─────────────────────────────────────────────────────────────────┐
│              SIMILARITY METRICS                                   │
│                                                                 │
│  COSINE SIMILARITY                                              │
│  ─────────────────                                              │
│  Mengukur SUDUT antara 2 vektor                                 │
│                                                                 │
│     v1 ●────────●  v2                                          │
│         \θ      /     cos(θ) = 1 → identik                      │
│          \    /        cos(θ) = 0 → orthogonal (tidak mirip)    │
│           \  /         cos(θ) = -1 → berlawanan                  │
│            \/                                                    │
│                                                                 │
│  Range: [-1, 1] — lebih tinggi = lebih mirip                     │
│  Best untuk: text embeddings, semantic search                    │
│                                                                 │
│  EUCLIDEAN DISTANCE                                             │
│  ────────────────────                                           │
│  Mengukur JARAK FISIK antara 2 titik                             │
│                                                                 │
│     (1,4) ●              d = √(Σ(a-b)²)                        │
│              \             d = 0 → identik                       │
│               \            d kecil → mirip                       │
│                ● (4,1)     d besar → berbeda                     │
│                                                                 │
│  Range: [0, ∞) — lebih kecil = lebih mirip                       │
│  Best untuk: spatial data, numerical features                    │
└─────────────────────────────────────────────────────────────────┘

4. CRUD Operations — Upsert, Query, Delete

Python — Pinecone CRUD Operations
# =============================================
# UPSERT: Menyimpan vektor (create/update)
# =============================================

# Format: list of (id, vector, metadata)
vectors_to_upsert = [
    {
        "id": "doc_001",
        "values": [0.023, -0.045, 0.078, 0.012, 0.056],  # ... 1536 dim
        "metadata": {
            "title": "Tutorial Python Pemula",
            "category": "programming",
            "language": "id",
            "source": "beebanelabs.com",
            "year": 2026,
            "chunk_text": "Python adalah bahasa pemrograman serbaguna..."
        }
    },
    {
        "id": "doc_002",
        "values": [0.025, -0.042, 0.085, 0.015, 0.050],
        "metadata": {
            "title": "Belajar JavaScript untuk Pemula",
            "category": "programming",
            "language": "id",
            "source": "beebanelabs.com",
            "year": 2026,
            "chunk_text": "JavaScript adalah bahasa pemrograman web..."
        }
    },
    {
        "id": "doc_003",
        "values": [-0.067, 0.034, -0.021, 0.078, 0.091],
        "metadata": {
            "title": "Resep Nasi Goreng Spesial",
            "category": "cooking",
            "language": "id",
            "source": "resepmama.com",
            "year": 2025,
            "chunk_text": "Nasi goreng adalah makanan khas Indonesia..."
        }
    }
]

# Upsert ke index
index.upsert(vectors=vectors_to_upsert)
print(f"Berhasil upsert {len(vectors_to_upsert)} vektor")

# Cek statistik setelah upsert
stats = index.describe_index_stats()
print(f"Total vektor: {stats.total_vector_count}")


# =============================================
# UPSERT dengan NAMESPACE (partition data)
# =============================================

# Namespace memisahkan data dalam index yang sama
index.upsert(
    vectors=[
        {"id": "art_001", "values": [...], "metadata": {"title": "..."}}
    ],
    namespace="articles"  # Namespace untuk artikel
)

index.upsert(
    vectors=[
        {"id": "prod_001", "values": [...], "metadata": {"title": "..."}}
    ],
    namespace="products"  # Namespace untuk produk
)


# =============================================
# QUERY: Mencari vektor mirip
# =============================================

# Buat embedding untuk query
query_text = "cara belajar coding untuk pemula"
query_vector = get_embedding(query_text)  # Fungsi embedding Anda

# Cari 3 vektor paling mirip
results = index.query(
    vector=query_vector,
    top_k=3,
    include_metadata=True
)

# Tampilkan hasil
for match in results.matches:
    print(f"ID: {match.id}")
    print(f"Score: {match.score:.4f}")  # Similarity score
    print(f"Title: {match.metadata.get('title')}")
    print(f"Category: {match.metadata.get('category')}")
    print(f"Text: {match.metadata.get('chunk_text', '')[:100]}...")
    print("---")


# =============================================
# QUERY dengan FILTER
# =============================================

# Filter berdasarkan metadata
filtered_results = index.query(
    vector=query_vector,
    top_k=5,
    include_metadata=True,
    filter={
        "category": {"$eq": "programming"},  # Hanya programming
        "year": {"$gte": 2025}               # Tahun >= 2025
    }
)

# Filter operators:
# $eq    — sama dengan
# $ne    — tidak sama
# $gt    — lebih besar
# $gte   — lebih besar atau sama
# $lt    — kurang dari
# $lte   — kurang dari atau sama
# $in    — dalam daftar
# $nin   — tidak dalam daftar


# =============================================
# FETCH: Ambil vektor by ID
# =============================================
fetched = index.fetch(ids=["doc_001", "doc_002"])
for vid, vector_data in fetched.vectors.items():
    print(f"ID: {vid}")
    print(f"Metadata: {vector_data.metadata}")


# =============================================
# UPDATE: Ubah metadata vektor
# =============================================
index.update(
    id="doc_001",
    set_metadata={"views": 1500, "updated": True}
)


# =============================================
# DELETE: Hapus vektor
# =============================================

# Hapus by ID
index.delete(ids=["doc_003"])

# Hapus semua dalam namespace
index.delete(delete_all=True, namespace="articles")

# Hapus dengan filter
index.delete(filter={"category": {"$eq": "deprecated"}})

5. Similarity Search & Filtering

Python — Advanced Similarity Search
# =============================================
# SEMANTIC SEARCH: Pencarian bermakna
# =============================================

def semantic_search(query, top_k=5, filter_dict=None):
    """Pencarian semantic yang bisa difilter"""
    query_vector = get_embedding(query)

    results = index.query(
        vector=query_vector,
        top_k=top_k,
        include_metadata=True,
        filter=filter_dict
    )

    return [
        {
            "id": match.id,
            "score": match.score,
            "title": match.metadata.get("title", ""),
            "text": match.metadata.get("chunk_text", ""),
            "category": match.metadata.get("category", ""),
            "metadata": match.metadata
        }
        for match in results.matches
    ]

# Contoh penggunaan:
# "machine learning" → menemukan artikel tentang ML, AI, deep learning
results = semantic_search("machine learning untuk pemula")
for r in results:
    print(f"[{r['score']:.3f}] {r['title']}")

# Dengan filter:
results = semantic_search(
    "tutorial coding",
    filter_dict={"category": "programming", "language": "id"}
)


# =============================================
# HYBRID SEARCH: Vector + Metadata filter
# =============================================

# Cari artikel programming, tapi yang terbaru
results = index.query(
    vector=get_embedding("framework web terbaik"),
    top_k=10,
    include_metadata=True,
    filter={
        "category": {"$in": ["programming", "web-dev"]},
        "year": {"$gte": 2025},
        "language": {"$eq": "id"}
    }
)


# =============================================
# MULTI-QUERY: Gabungkan beberapa query
# =============================================

import numpy as np

def multi_query_search(queries, weights=None, top_k=5):
    """Gabungkan beberapa query menjadi satu vektor"""
    if weights is None:
        weights = [1.0 / len(queries)] * len(queries)

    # Buat embedding untuk setiap query
    vectors = [get_embedding(q) for q in queries]

    # Weighted average
    combined = np.zeros_like(vectors[0])
    for vec, weight in zip(vectors, weights):
        combined += np.array(vec) * weight

    # Normalize
    combined = combined / np.linalg.norm(combined)

    return index.query(
        vector=combined.tolist(),
        top_k=top_k,
        include_metadata=True
    )

# Contoh: cari yang tentang "Python" DAN "data science"
results = multi_query_search(
    ["Python programming", "data science tutorial"],
    weights=[0.6, 0.4]  # Python lebih diprioritaskan
)


# =============================================
# SEARCH + RE-RANKING
# =============================================

def search_with_rerank(query, top_k=20, final_k=5):
    """Ambil banyak hasil, lalu re-rank"""
    # Step 1: Ambil kandidat dari vector search
    candidates = index.query(
        vector=get_embedding(query),
        top_k=top_k,
        include_metadata=True
    )

    # Step 2: Re-rank berdasarkan relevance (misal dengan LLM)
    texts = [m.metadata.get("chunk_text", "") for m in candidates.matches]

    # Contoh sederhana: re-rank berdasarkan keyword overlap
    query_words = set(query.lower().split())
    scored = []
    for match, text in zip(candidates.matches, texts):
        text_words = set(text.lower().split())
        overlap = len(query_words & text_words)
        final_score = match.score * 0.7 + (overlap / len(query_words)) * 0.3
        scored.append((match, final_score))

    scored.sort(key=lambda x: x[1], reverse=True)
    return scored[:final_k]

6. RAG — Retrieval Augmented Generation

RAG (Retrieval Augmented Generation) adalah teknik yang menggabungkan retrieval (mengambil data relevan dari vector database) dengan generation (LLM menghasilkan jawaban). Ini memungkinkan LLM menjawab pertanyaan berdasarkan data spesifik Anda — tanpa perlu fine-tuning!

Diagram: RAG Pipeline
┌─────────────────────────────────────────────────────────────────┐
│                    RAG PIPELINE                                  │
│                                                                 │
│  User Query: "Berapa harga paket Enterprise BeebaneLabs?"       │
│                          │                                      │
│                          ▼                                      │
│  ┌─────────────────────────────────────┐                        │
│  │  STEP 1: RETRIEVAL                  │                        │
│  │  Query → Embedding → Vector Search  │                        │
│  │  di Pinecone                        │                        │
│  └──────────────────┬──────────────────┘                        │
│                     │                                           │
│                     ▼                                           │
│  Top-K Results (dokumen relevan):                               │
│  • "Paket Enterprise: Rp 5jt/bulan..."  (score: 0.95)          │
│  • "Fitur Enterprise mencakup..."        (score: 0.89)          │
│  • "Perbandingan paket..."              (score: 0.82)           │
│                     │                                           │
│                     ▼                                           │
│  ┌─────────────────────────────────────┐                        │
│  │  STEP 2: AUGMENTATION               │                        │
│  │  Gabungkan: System Prompt +         │                        │
│  │  Context (retrieved docs) +          │                        │
│  │  User Query                          │                        │
│  └──────────────────┬──────────────────┘                        │
│                     │                                           │
│                     ▼                                           │
│  ┌─────────────────────────────────────┐                        │
│  │  STEP 3: GENERATION                 │                        │
│  │  LLM (GPT-4, Claude, dll)          │                        │
│  │  generate jawaban berdasarkan        │                        │
│  │  context yang diberikan              │                        │
│  └──────────────────┬──────────────────┘                        │
│                     │                                           │
│                     ▼                                           │
│  "Paket Enterprise BeebaneLabs          │                        │
│   harganya Rp 5.000.000 per bulan.      │                        │
│   Termasuk fitur unlimited users,       │                        │
│   priority support, dan custom domain." │                        │
│                                         │                        │
│  ✅ Jawaban akurat dari data Anda!      │                        │
│  ✅ Tanpa fine-tuning!                  │                        │
│  ✅ Bisa cite sumbernya!                │                        │
└─────────────────────────────────────────────────────────────────┘
Python — Full RAG Pipeline
# =============================================
# FULL RAG PIPELINE
# =============================================
# pip install pinecone openai

import openai
from pinecone import Pinecone

# Setup
pc = Pinecone(api_key="YOUR_PINECONE_KEY")
index = pc.Index("knowledge-base")
openai_client = openai.OpenAI(api_key="YOUR_OPENAI_KEY")


def get_embedding(text, model="text-embedding-3-small"):
    """Buat embedding dari teks"""
    response = openai_client.embeddings.create(
        model=model, input=text
    )
    return response.data[0].embedding


def retrieve_context(query, top_k=5, namespace=None):
    """Ambil konteks relevan dari Pinecone"""
    query_vector = get_embedding(query)

    results = index.query(
        vector=query_vector,
        top_k=top_k,
        include_metadata=True,
        namespace=namespace
    )

    contexts = []
    sources = []
    for match in results.matches:
        if match.score > 0.7:  # Threshold kemiripan
            contexts.append(match.metadata.get("chunk_text", ""))
            sources.append({
                "id": match.id,
                "title": match.metadata.get("title", ""),
                "score": match.score
            })

    return contexts, sources


def generate_answer(query, contexts, sources):
    """Generate jawaban menggunakan LLM"""
    context_text = "\n\n---\n\n".join(contexts)

    system_prompt = """Anda adalah asisten AI yang membantu menjawab
pertanyaan berdasarkan konteks yang diberikan.

Aturan:
1. Jawab HANYA berdasarkan konteks yang diberikan
2. Jika informasi tidak ada di konteks, katakan "Saya tidak menemukan
   informasi tersebut dalam database kami"
3. Berikan jawaban yang jelas dan informatif
4. Sebutkan sumber jika relevan"""

    user_prompt = f"""Konteks dari database:
---
{context_text}
---

Pertanyaan pengguna: {query}

Jawab pertanyaan berdasarkan konteks di atas."""

    response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        temperature=0.3,  # Rendah = lebih faktual
        max_tokens=1000
    )

    return response.choices[0].message.content


def rag_pipeline(query, namespace=None):
    """Full RAG pipeline"""
    # Step 1: Retrieve
    contexts, sources = retrieve_context(query, namespace=namespace)

    if not contexts:
        return "Maaf, saya tidak menemukan informasi yang relevan.", []

    # Step 2 + 3: Augment + Generate
    answer = generate_answer(query, contexts, sources)

    return answer, sources


# =============================================
# CONTOH PENGGUNAAN
# =============================================

# Pertanyaan tentang data Anda
query = "Bagaimana cara menginstall Python di Windows?"
answer, sources = rag_pipeline(query)

print("Jawaban:", answer)
print("\nSumber:")
for s in sources:
    print(f"  - {s['title']} (score: {s['score']:.3f})")


# =============================================
# INGEST DATA: Memasukkan dokumen ke Pinecone
# =============================================

def ingest_document(doc_id, text, metadata, chunk_size=500):
    """Pecah dokumen jadi chunks dan masukkan ke Pinecone"""
    # Step 1: Chunking (pecah teks panjang)
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size):
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)

    # Step 2: Embedding + Upsert
    vectors = []
    for i, chunk in enumerate(chunks):
        embedding = get_embedding(chunk)
        chunk_id = f"{doc_id}_chunk_{i}"

        vectors.append({
            "id": chunk_id,
            "values": embedding,
            "metadata": {
                **metadata,
                "chunk_text": chunk,
                "chunk_index": i,
                "parent_doc_id": doc_id
            }
        })

    # Batch upsert (Pinecone max 1000 per batch)
    batch_size = 100
    for i in range(0, len(vectors), batch_size):
        batch = vectors[i:i + batch_size]
        index.upsert(vectors=batch)

    print(f"Berhasil ingest {len(chunks)} chunks untuk doc {doc_id}")


# Contoh ingest:
ingest_document(
    doc_id="tutorial_python",
    text="Python adalah bahasa pemrograman serbaguna yang diciptakan oleh Guido van Rossum...",
    metadata={
        "title": "Tutorial Python Lengkap",
        "category": "programming",
        "source": "beebanelabs.com"
    }
)

7. Use Cases: Semantic Search & Recommendation

Python — Use Case Implementations
# =============================================
# USE CASE 1: Semantic Search Engine
# =============================================

class SemanticSearchEngine:
    def __init__(self, index_name, embedding_model="text-embedding-3-small"):
        pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))
        self.index = pc.Index(index_name)
        self.model = embedding_model

    def search(self, query, filters=None, top_k=10):
        vector = get_embedding(query, self.model)
        return self.index.query(
            vector=vector,
            top_k=top_k,
            include_metadata=True,
            filter=filters
        )

    def search_with_threshold(self, query, threshold=0.75, **kwargs):
        results = self.search(query, **kwargs)
        return [m for m in results.matches if m.score >= threshold]

# Usage:
engine = SemanticSearchEngine("knowledge-base")
results = engine.search("bagaimana cara deploy aplikasi ke cloud")


# =============================================
# USE CASE 2: Product Recommendation
# =============================================

def recommend_products(product_id, top_k=5):
    """Rekomendasikan produk serupa"""
    # Fetch vektor produk yang sedang dilihat
    product = index.fetch(ids=[product_id])
    product_vector = product.vectors[product_id].values

    # Cari produk dengan embedding mirip
    similar = index.query(
        vector=product_vector,
        top_k=top_k + 1,  # +1 karena produk sendiri juga muncul
        include_metadata=True,
        filter={"in_stock": {"$eq": True}}
    )

    # Exclude produk yang sedang dilihat
    return [m for m in similar.matches if m.id != product_id][:top_k]


# =============================================
# USE CASE 3: Duplicate Detection
# =============================================

def find_duplicates(texts, threshold=0.95):
    """Temukan teks yang hampir identik"""
    embeddings = [get_embedding(t) for t in texts]
    duplicates = []

    for i in range(len(texts)):
        results = index.query(
            vector=embeddings[i],
            top_k=5,
            include_metadata=True
        )
        for match in results.matches:
            if match.score >= threshold and match.id != f"doc_{i}":
                duplicates.append({
                    "original": texts[i],
                    "duplicate": match.metadata.get("text", ""),
                    "score": match.score
                })

    return duplicates


# =============================================
# USE CASE 4: Image Similarity Search
# =============================================
# pip install transformers torch
# from transformers import CLIPProcessor, CLIPModel

# def image_to_vector(image_path):
#     model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
#     processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
#
#     image = Image.open(image_path)
#     inputs = processor(images=image, return_tensors="pt")
#     vector = model.get_image_features(**inputs)
#     return vector.detach().numpy().flatten().tolist()
#
# def search_similar_images(image_path, top_k=5):
#     query_vector = image_to_vector(image_path)
#     return index.query(vector=query_vector, top_k=top_k)

8. Pinecone vs Alternatives

Database Tipe Kelebihan Kekurangan
PineconeManaged cloudMudah, cepat, scalableVendor lock-in, biaya
WeaviateOpen-source / cloudModular, GraphQL APISetup lebih kompleks
QdrantOpen-source / cloudFilter canggih, RustEkosistem lebih kecil
MilvusOpen-sourceSangat scalable, GPU supportButuh infra besar
ChromaDBOpen-source, embeddedSimple, Python-firstBelum production-ready
pgvectorPostgreSQL extensionReuse PG infrastructurePerforma lebih rendah
FAISSLibrary (bukan DB)Sangat cepat, dari MetaTidak punya persistence

9. Best Practices & Optimasi

💡 Best Practices Vector Database
  • Chunking yang baik — pecah dokumen jadi 200-500 token, overlap 50-100 token
  • Konsistensi model — selalu gunakan model embedding yang SAMA untuk index dan query
  • Metadata filter — filter metadata sebelum vector search untuk efisiensi
  • Namespace — pisahkan data per use case (articles, products, users)
  • Batch upsert — jangan satu per satu, gunakan batch untuk efisiensi
  • Top-K selection — mulai dengan top_k=10-20, lalu re-rank untuk akurasi
  • Threshold — tetapkan minimum similarity score (0.7-0.8) untuk filter noise
  • Monitor — track recall@k dan latency untuk evaluasi performa

10. Quiz Pemahaman

1. Apa yang disimpan oleh Vector Database?

2. Mengapa "sepatu olahraga" dan "running shoes" bisa ditemukan oleh semantic search?

3. Apa itu RAG?

4. Mengapa chunking diperlukan sebelum menyimpan dokumen ke vector database?

5. Metric apa yang paling umum digunakan untuk semantic search?

Rangkuman

📝 Poin Penting
  • Vector Database — menyimpan vektor embedding, mencari berdasarkan kemiripan makna
  • Embeddings — ubah teks/gambar menjadi vektor angka berdimensi tinggi
  • Cosine similarity — metric paling umum untuk semantic search
  • Pinecone — managed vector DB, mudah setup, latensi rendah
  • RAG — gabungkan retrieval dari vector DB + LLM generation = jawaban akurat dari data Anda
  • Chunking — pecah dokumen jadi potongan kecil untuk retrieval presisi
  • Konsistensi model — selalu gunakan model embedding yang sama