Vector Databases: Pinecone — Embeddings, Similarity Search & RAG

📋 Daftar Isi

Pengenalan Vector Database
Embeddings — Representasi Data sebagai Vektor
Setup Pinecone
CRUD Operations — Upsert, Query, Delete
Similarity Search & Filtering
RAG — Retrieval Augmented Generation
Use Cases: Semantic Search & Recommendation
Pinecone vs Alternatives
Best Practices & Optimasi
Quiz Pemahaman

1. Pengenalan Vector Database

Vector Database adalah database khusus yang menyimpan dan mengelola data dalam bentuk vektor (angka berdimensi tinggi). Berbeda dari database tradisional yang mencari data berdasarkan keyword yang persis sama, vector database mencari data berdasarkan kemiripan makna (semantic similarity).

Bayangkan Anda mencari "sepatu olahraga" di database tradisional — hanya menemukan dokumen yang mengandung kata "sepatu olahraga". Di vector database, Anda bisa menemukan dokumen tentang "running shoes", "sneakers untuk jogging", atau "footwear aktivitas fisik" — karena semua ini memiliki makna yang mirip.

Diagram: Vector Database Concept

┌─────────────────────────────────────────────────────────────────┐
│                  VECTOR DATABASE CONCEPT                         │
│                                                                 │
│  Input Text → Embedding Model → Vector (angka) → Simpan di DB  │
│                                                                 │
│  "sepatu olahraga"  →  [0.23, -0.45, 0.87, ..., 0.12]  (384D)│
│  "running shoes"    →  [0.25, -0.42, 0.85, ..., 0.15]  (384D)│
│  "sneakers joging"  →  [0.21, -0.48, 0.82, ..., 0.10]  (384D)│
│  "mobil sport"      →  [-0.67, 0.34, -0.21, ..., 0.78] (384D)│
│                                                                 │
│  Semantic Search: "sepatu lari" → vector query                  │
│                                                                 │
│  Hasil (by cosine similarity):                                  │
│  1. "running shoes"     → 0.98 (sangat mirip) ✅                │
│  2. "sepatu olahraga"   → 0.95 (mirip) ✅                      │
│  3. "sneakers joging"   → 0.93 (mirip) ✅                      │
│  4. "mobil sport"       → 0.12 (tidak mirip) ❌                 │
│                                                                 │
│  Data yang "bermakna sama" punya vektor yang berdekatan!         │
└─────────────────────────────────────────────────────────────────┘

Mengapa Vector Database Penting?

Use Case	Contoh Aplikasi	Mengapa Butuh Vector DB?
Semantic Search	Search engine cerdas	Cari berdasarkan makna, bukan keyword
RAG (Retrieval Augmented Generation)	ChatGPT + knowledge base	LLM bisa jawab pertanyaan dari data Anda
Recommendation	Produk serupa	Temukan item dengan embedding mirip
Image Search	Pencarian gambar visual	Gambar → vektor → cari kemiripan
Anomaly Detection	Deteksi fraud	Data outlier punya vektor jauh dari cluster
Clustering	Pengelompokan otomatis	Group data berdasarkan kemiripan semantik

Apa itu Pinecone?

Pinecone adalah vector database managed (fully hosted) yang paling populer. Keunggulan Pinecone: tidak perlu setup server, skalabilitas otomatis, latensi rendah (<50ms), dan integrasi mudah dengan ekosistem AI/ML seperti OpenAI, LangChain, dan LlamaIndex.

2. Embeddings — Representasi Data sebagai Vektor

Embedding adalah proses mengubah data (teks, gambar, audio) menjadi vektor angka berdimensi tinggi. Vektor ini merepresentasikan "makna" atau "fitur" dari data tersebut dalam ruang matematika.

Model Embedding Populer

Model	Dimensi	Provider	Cocok Untuk
text-embedding-3-small	1536	OpenAI	Teks umum, cost-effective
text-embedding-3-large	3072	OpenAI	Teks, akurasi tinggi
all-MiniLM-L6-v2	384	Sentence Transformers	Gratis, cepat, lokal
multilingual-e5-large	1024	Microsoft	Multilingual (termasuk Indonesia)
embed-english-v3.0	1024	Cohere	Teks bahasa Inggris
gecko	768	Google	Multi-purpose

Python — Membuat Embeddings

# =============================================
# EMBEDDING dengan OpenAI
# =============================================
# pip install openai

import openai

client = openai.OpenAI(api_key="sk-...")

# Buat embedding untuk satu teks
response = client.embeddings.create(
    model="text-embedding-3-small",
    input="Mesin ini adalah panduan lengkap belajar Python untuk pemula"
)

vector = response.data[0].embedding
print(f"Dimensi: {len(vector)}")  # 1536
print(f"5 elemen pertama: {vector[:5]}")
# [0.0234, -0.0456, 0.0789, -0.0123, 0.0567]

# Buat embedding untuk banyak teks sekaligus (batch)
texts = [
    "Cara belajar Python untuk pemula",
    "Tutorial JavaScript dasar",
    "Panduan database MySQL",
    "Resep nasi goreng spesial",
    "Jadwal pertandingan sepak bola"
]

response = client.embeddings.create(
    model="text-embedding-3-small",
    input=texts
)

embeddings = [item.embedding for item in response.data]
print(f"Jumlah embedding: {len(embeddings)}")  # 5


# =============================================
# EMBEDDING dengan Sentence Transformers (GRATIS, LOKAL)
# =============================================
# pip install sentence-transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

# Embedding satu teks
vector = model.encode("Belajar Python dari nol")
print(f"Dimensi: {len(vector)}")  # 384

# Embedding batch
texts = [
    "Cara belajar Python untuk pemula",
    "Tutorial JavaScript dasar",
    "Panduan database MySQL",
    "Resep nasi goreng spesial"
]
vectors = model.encode(texts)
print(f"Shape: {vectors.shape}")  # (4, 384)


# =============================================
# EMBEDDING MULTILINGUAL (untuk bahasa Indonesia)
# =============================================
model_multi = SentenceTransformer('intfloat/multilingual-e5-large')

# Embedding bahasa Indonesia
vector_id = model_multi.encode("query: Apa itu machine learning?")
print(f"Dimensi: {len(vector_id)}")  # 1024

💡 Tips Embedding

Konsistensi model — gunakan model yang SAMA untuk indexing dan querying
Bahasa Indonesia — gunakan multilingual model (multilingual-e5, BGE-M3)
Chunking — pecah dokumen panjang menjadi chunk 200-500 token sebelum embedding
Prefix query — beberapa model butuh prefix "query:" atau "passage:"

3. Setup Pinecone

Python — Setup Pinecone

# =============================================
# STEP 1: Install
# =============================================
# pip install pinecone

# =============================================
# STEP 2: Inisialisasi Pinecone
# =============================================
from pinecone import Pinecone, ServerlessSpec

# Inisialisasi client
pc = Pinecone(api_key="YOUR_PINECONE_API_KEY")

# =============================================
# STEP 3: Buat Index (database untuk vektor)
# =============================================

# Cek index yang sudah ada
existing_indexes = pc.list_indexes().names()
print(f"Index yang ada: {existing_indexes}")

# Buat index baru jika belum ada
if 'tutorial-index' not in existing_indexes:
    pc.create_index(
        name='tutorial-index',
        dimension=1536,       # Sesuai dimensi embedding model
        metric='cosine',      # cosine, euclidean, dotproduct
        spec=ServerlessSpec(
            cloud='aws',
            region='us-east-1'
        )
    )
    print("Index 'tutorial-index' berhasil dibuat!")

# Connect ke index
index = pc.Index('tutorial-index')

# Cek statistik index
stats = index.describe_index_stats()
print(f"Total vectors: {stats.total_vector_count}")
print(f"Dimension: {stats.dimension}")
print(f"Namespaces: {list(stats.namespaces.keys())}")


# =============================================
# METRIC DISTANCE: Pilih yang sesuai
# =============================================
# cosine   → Cocok untuk semantic search (paling umum)
#            Mengukur sudut antara 2 vektor (0-1, 1=identik)
#
# euclidean → Mengukur jarak fisik antara 2 titik
#             Cocok untuk data numerik/spatial
#
# dotproduct → Seperti cosine tapi tanpa normalisasi
#              Cocok untuk vektor yang sudah normalized

Diagram: Similarity Metrics

┌─────────────────────────────────────────────────────────────────┐
│              SIMILARITY METRICS                                   │
│                                                                 │
│  COSINE SIMILARITY                                              │
│  ─────────────────                                              │
│  Mengukur SUDUT antara 2 vektor                                 │
│                                                                 │
│     v1 ●────────●  v2                                          │
│         \θ      /     cos(θ) = 1 → identik                      │
│          \    /        cos(θ) = 0 → orthogonal (tidak mirip)    │
│           \  /         cos(θ) = -1 → berlawanan                  │
│            \/                                                    │
│                                                                 │
│  Range: [-1, 1] — lebih tinggi = lebih mirip                     │
│  Best untuk: text embeddings, semantic search                    │
│                                                                 │
│  EUCLIDEAN DISTANCE                                             │
│  ────────────────────                                           │
│  Mengukur JARAK FISIK antara 2 titik                             │
│                                                                 │
│     (1,4) ●              d = √(Σ(a-b)²)                        │
│              \             d = 0 → identik                       │
│               \            d kecil → mirip                       │
│                ● (4,1)     d besar → berbeda                     │
│                                                                 │
│  Range: [0, ∞) — lebih kecil = lebih mirip                       │
│  Best untuk: spatial data, numerical features                    │
└─────────────────────────────────────────────────────────────────┘

4. CRUD Operations — Upsert, Query, Delete

Python — Pinecone CRUD Operations

# =============================================
# UPSERT: Menyimpan vektor (create/update)
# =============================================

# Format: list of (id, vector, metadata)
vectors_to_upsert = [
    {
        "id": "doc_001",
        "values": [0.023, -0.045, 0.078, 0.012, 0.056],  # ... 1536 dim
        "metadata": {
            "title": "Tutorial Python Pemula",
            "category": "programming",
            "language": "id",
            "source": "beebanelabs.com",
            "year": 2026,
            "chunk_text": "Python adalah bahasa pemrograman serbaguna..."
        }
    },
    {
        "id": "doc_002",
        "values": [0.025, -0.042, 0.085, 0.015, 0.050],
        "metadata": {
            "title": "Belajar JavaScript untuk Pemula",
            "category": "programming",
            "language": "id",
            "source": "beebanelabs.com",
            "year": 2026,
            "chunk_text": "JavaScript adalah bahasa pemrograman web..."
        }
    },
    {
        "id": "doc_003",
        "values": [-0.067, 0.034, -0.021, 0.078, 0.091],
        "metadata": {
            "title": "Resep Nasi Goreng Spesial",
            "category": "cooking",
            "language": "id",
            "source": "resepmama.com",
            "year": 2025,
            "chunk_text": "Nasi goreng adalah makanan khas Indonesia..."
        }
    }
]

# Upsert ke index
index.upsert(vectors=vectors_to_upsert)
print(f"Berhasil upsert {len(vectors_to_upsert)} vektor")

# Cek statistik setelah upsert
stats = index.describe_index_stats()
print(f"Total vektor: {stats.total_vector_count}")


# =============================================
# UPSERT dengan NAMESPACE (partition data)
# =============================================

# Namespace memisahkan data dalam index yang sama
index.upsert(
    vectors=[
        {"id": "art_001", "values": [...], "metadata": {"title": "..."}}
    ],
    namespace="articles"  # Namespace untuk artikel
)

index.upsert(
    vectors=[
        {"id": "prod_001", "values": [...], "metadata": {"title": "..."}}
    ],
    namespace="products"  # Namespace untuk produk
)


# =============================================
# QUERY: Mencari vektor mirip
# =============================================

# Buat embedding untuk query
query_text = "cara belajar coding untuk pemula"
query_vector = get_embedding(query_text)  # Fungsi embedding Anda

# Cari 3 vektor paling mirip
results = index.query(
    vector=query_vector,
    top_k=3,
    include_metadata=True
)

# Tampilkan hasil
for match in results.matches:
    print(f"ID: {match.id}")
    print(f"Score: {match.score:.4f}")  # Similarity score
    print(f"Title: {match.metadata.get('title')}")
    print(f"Category: {match.metadata.get('category')}")
    print(f"Text: {match.metadata.get('chunk_text', '')[:100]}...")
    print("---")


# =============================================
# QUERY dengan FILTER
# =============================================

# Filter berdasarkan metadata
filtered_results = index.query(
    vector=query_vector,
    top_k=5,
    include_metadata=True,
    filter={
        "category": {"$eq": "programming"},  # Hanya programming
        "year": {"$gte": 2025}               # Tahun >= 2025
    }
)

# Filter operators:
# $eq    — sama dengan
# $ne    — tidak sama
# $gt    — lebih besar
# $gte   — lebih besar atau sama
# $lt    — kurang dari
# $lte   — kurang dari atau sama
# $in    — dalam daftar
# $nin   — tidak dalam daftar


# =============================================
# FETCH: Ambil vektor by ID
# =============================================
fetched = index.fetch(ids=["doc_001", "doc_002"])
for vid, vector_data in fetched.vectors.items():
    print(f"ID: {vid}")
    print(f"Metadata: {vector_data.metadata}")


# =============================================
# UPDATE: Ubah metadata vektor
# =============================================
index.update(
    id="doc_001",
    set_metadata={"views": 1500, "updated": True}
)


# =============================================
# DELETE: Hapus vektor
# =============================================

# Hapus by ID
index.delete(ids=["doc_003"])

# Hapus semua dalam namespace
index.delete(delete_all=True, namespace="articles")

# Hapus dengan filter
index.delete(filter={"category": {"$eq": "deprecated"}})

5. Similarity Search & Filtering

Python — Advanced Similarity Search

# =============================================
# SEMANTIC SEARCH: Pencarian bermakna
# =============================================

def semantic_search(query, top_k=5, filter_dict=None):
    """Pencarian semantic yang bisa difilter"""
    query_vector = get_embedding(query)

    results = index.query(
        vector=query_vector,
        top_k=top_k,
        include_metadata=True,
        filter=filter_dict
    )

    return [
        {
            "id": match.id,
            "score": match.score,
            "title": match.metadata.get("title", ""),
            "text": match.metadata.get("chunk_text", ""),
            "category": match.metadata.get("category", ""),
            "metadata": match.metadata
        }
        for match in results.matches
    ]

# Contoh penggunaan:
# "machine learning" → menemukan artikel tentang ML, AI, deep learning
results = semantic_search("machine learning untuk pemula")
for r in results:
    print(f"[{r['score']:.3f}] {r['title']}")

# Dengan filter:
results = semantic_search(
    "tutorial coding",
    filter_dict={"category": "programming", "language": "id"}
)


# =============================================
# HYBRID SEARCH: Vector + Metadata filter
# =============================================

# Cari artikel programming, tapi yang terbaru
results = index.query(
    vector=get_embedding("framework web terbaik"),
    top_k=10,
    include_metadata=True,
    filter={
        "category": {"$in": ["programming", "web-dev"]},
        "year": {"$gte": 2025},
        "language": {"$eq": "id"}
    }
)


# =============================================
# MULTI-QUERY: Gabungkan beberapa query
# =============================================

import numpy as np

def multi_query_search(queries, weights=None, top_k=5):
    """Gabungkan beberapa query menjadi satu vektor"""
    if weights is None:
        weights = [1.0 / len(queries)] * len(queries)

    # Buat embedding untuk setiap query
    vectors = [get_embedding(q) for q in queries]

    # Weighted average
    combined = np.zeros_like(vectors[0])
    for vec, weight in zip(vectors, weights):
        combined += np.array(vec) * weight

    # Normalize
    combined = combined / np.linalg.norm(combined)

    return index.query(
        vector=combined.tolist(),
        top_k=top_k,
        include_metadata=True
    )

# Contoh: cari yang tentang "Python" DAN "data science"
results = multi_query_search(
    ["Python programming", "data science tutorial"],
    weights=[0.6, 0.4]  # Python lebih diprioritaskan
)


# =============================================
# SEARCH + RE-RANKING
# =============================================

def search_with_rerank(query, top_k=20, final_k=5):
    """Ambil banyak hasil, lalu re-rank"""
    # Step 1: Ambil kandidat dari vector search
    candidates = index.query(
        vector=get_embedding(query),
        top_k=top_k,
        include_metadata=True
    )

    # Step 2: Re-rank berdasarkan relevance (misal dengan LLM)
    texts = [m.metadata.get("chunk_text", "") for m in candidates.matches]

    # Contoh sederhana: re-rank berdasarkan keyword overlap
    query_words = set(query.lower().split())
    scored = []
    for match, text in zip(candidates.matches, texts):
        text_words = set(text.lower().split())
        overlap = len(query_words & text_words)
        final_score = match.score * 0.7 + (overlap / len(query_words)) * 0.3
        scored.append((match, final_score))

    scored.sort(key=lambda x: x[1], reverse=True)
    return scored[:final_k]

6. RAG — Retrieval Augmented Generation

RAG (Retrieval Augmented Generation) adalah teknik yang menggabungkan retrieval (mengambil data relevan dari vector database) dengan generation (LLM menghasilkan jawaban). Ini memungkinkan LLM menjawab pertanyaan berdasarkan data spesifik Anda — tanpa perlu fine-tuning!

Diagram: RAG Pipeline

┌─────────────────────────────────────────────────────────────────┐
│                    RAG PIPELINE                                  │
│                                                                 │
│  User Query: "Berapa harga paket Enterprise BeebaneLabs?"       │
│                          │                                      │
│                          ▼                                      │
│  ┌─────────────────────────────────────┐                        │
│  │  STEP 1: RETRIEVAL                  │                        │
│  │  Query → Embedding → Vector Search  │                        │
│  │  di Pinecone                        │                        │
│  └──────────────────┬──────────────────┘                        │
│                     │                                           │
│                     ▼                                           │
│  Top-K Results (dokumen relevan):                               │
│  • "Paket Enterprise: Rp 5jt/bulan..."  (score: 0.95)          │
│  • "Fitur Enterprise mencakup..."        (score: 0.89)          │
│  • "Perbandingan paket..."              (score: 0.82)           │
│                     │                                           │
│                     ▼                                           │
│  ┌─────────────────────────────────────┐                        │
│  │  STEP 2: AUGMENTATION               │                        │
│  │  Gabungkan: System Prompt +         │                        │
│  │  Context (retrieved docs) +          │                        │
│  │  User Query                          │                        │
│  └──────────────────┬──────────────────┘                        │
│                     │                                           │
│                     ▼                                           │
│  ┌─────────────────────────────────────┐                        │
│  │  STEP 3: GENERATION                 │                        │
│  │  LLM (GPT-4, Claude, dll)          │                        │
│  │  generate jawaban berdasarkan        │                        │
│  │  context yang diberikan              │                        │
│  └──────────────────┬──────────────────┘                        │
│                     │                                           │
│                     ▼                                           │
│  "Paket Enterprise BeebaneLabs          │                        │
│   harganya Rp 5.000.000 per bulan.      │                        │
│   Termasuk fitur unlimited users,       │                        │
│   priority support, dan custom domain." │                        │
│                                         │                        │
│  ✅ Jawaban akurat dari data Anda!      │                        │
│  ✅ Tanpa fine-tuning!                  │                        │
│  ✅ Bisa cite sumbernya!                │                        │
└─────────────────────────────────────────────────────────────────┘

Python — Full RAG Pipeline

# =============================================
# FULL RAG PIPELINE
# =============================================
# pip install pinecone openai

import openai
from pinecone import Pinecone

# Setup
pc = Pinecone(api_key="YOUR_PINECONE_KEY")
index = pc.Index("knowledge-base")
openai_client = openai.OpenAI(api_key="YOUR_OPENAI_KEY")


def get_embedding(text, model="text-embedding-3-small"):
    """Buat embedding dari teks"""
    response = openai_client.embeddings.create(
        model=model, input=text
    )
    return response.data[0].embedding


def retrieve_context(query, top_k=5, namespace=None):
    """Ambil konteks relevan dari Pinecone"""
    query_vector = get_embedding(query)

    results = index.query(
        vector=query_vector,
        top_k=top_k,
        include_metadata=True,
        namespace=namespace
    )

    contexts = []
    sources = []
    for match in results.matches:
        if match.score > 0.7:  # Threshold kemiripan
            contexts.append(match.metadata.get("chunk_text", ""))
            sources.append({
                "id": match.id,
                "title": match.metadata.get("title", ""),
                "score": match.score
            })

    return contexts, sources


def generate_answer(query, contexts, sources):
    """Generate jawaban menggunakan LLM"""
    context_text = "\n\n---\n\n".join(contexts)

    system_prompt = """Anda adalah asisten AI yang membantu menjawab
pertanyaan berdasarkan konteks yang diberikan.

Aturan:
1. Jawab HANYA berdasarkan konteks yang diberikan
2. Jika informasi tidak ada di konteks, katakan "Saya tidak menemukan
   informasi tersebut dalam database kami"
3. Berikan jawaban yang jelas dan informatif
4. Sebutkan sumber jika relevan"""

    user_prompt = f"""Konteks dari database:
---
{context_text}
---

Pertanyaan pengguna: {query}

Jawab pertanyaan berdasarkan konteks di atas."""

    response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        temperature=0.3,  # Rendah = lebih faktual
        max_tokens=1000
    )

    return response.choices[0].message.content


def rag_pipeline(query, namespace=None):
    """Full RAG pipeline"""
    # Step 1: Retrieve
    contexts, sources = retrieve_context(query, namespace=namespace)

    if not contexts:
        return "Maaf, saya tidak menemukan informasi yang relevan.", []

    # Step 2 + 3: Augment + Generate
    answer = generate_answer(query, contexts, sources)

    return answer, sources


# =============================================
# CONTOH PENGGUNAAN
# =============================================

# Pertanyaan tentang data Anda
query = "Bagaimana cara menginstall Python di Windows?"
answer, sources = rag_pipeline(query)

print("Jawaban:", answer)
print("\nSumber:")
for s in sources:
    print(f"  - {s['title']} (score: {s['score']:.3f})")


# =============================================
# INGEST DATA: Memasukkan dokumen ke Pinecone
# =============================================

def ingest_document(doc_id, text, metadata, chunk_size=500):
    """Pecah dokumen jadi chunks dan masukkan ke Pinecone"""
    # Step 1: Chunking (pecah teks panjang)
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size):
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)

    # Step 2: Embedding + Upsert
    vectors = []
    for i, chunk in enumerate(chunks):
        embedding = get_embedding(chunk)
        chunk_id = f"{doc_id}_chunk_{i}"

        vectors.append({
            "id": chunk_id,
            "values": embedding,
            "metadata": {
                **metadata,
                "chunk_text": chunk,
                "chunk_index": i,
                "parent_doc_id": doc_id
            }
        })

    # Batch upsert (Pinecone max 1000 per batch)
    batch_size = 100
    for i in range(0, len(vectors), batch_size):
        batch = vectors[i:i + batch_size]
        index.upsert(vectors=batch)

    print(f"Berhasil ingest {len(chunks)} chunks untuk doc {doc_id}")


# Contoh ingest:
ingest_document(
    doc_id="tutorial_python",
    text="Python adalah bahasa pemrograman serbaguna yang diciptakan oleh Guido van Rossum...",
    metadata={
        "title": "Tutorial Python Lengkap",
        "category": "programming",
        "source": "beebanelabs.com"
    }
)

7. Use Cases: Semantic Search & Recommendation

Python — Use Case Implementations

# =============================================
# USE CASE 1: Semantic Search Engine
# =============================================

class SemanticSearchEngine:
    def __init__(self, index_name, embedding_model="text-embedding-3-small"):
        pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))
        self.index = pc.Index(index_name)
        self.model = embedding_model

    def search(self, query, filters=None, top_k=10):
        vector = get_embedding(query, self.model)
        return self.index.query(
            vector=vector,
            top_k=top_k,
            include_metadata=True,
            filter=filters
        )

    def search_with_threshold(self, query, threshold=0.75, **kwargs):
        results = self.search(query, **kwargs)
        return [m for m in results.matches if m.score >= threshold]

# Usage:
engine = SemanticSearchEngine("knowledge-base")
results = engine.search("bagaimana cara deploy aplikasi ke cloud")


# =============================================
# USE CASE 2: Product Recommendation
# =============================================

def recommend_products(product_id, top_k=5):
    """Rekomendasikan produk serupa"""
    # Fetch vektor produk yang sedang dilihat
    product = index.fetch(ids=[product_id])
    product_vector = product.vectors[product_id].values

    # Cari produk dengan embedding mirip
    similar = index.query(
        vector=product_vector,
        top_k=top_k + 1,  # +1 karena produk sendiri juga muncul
        include_metadata=True,
        filter={"in_stock": {"$eq": True}}
    )

    # Exclude produk yang sedang dilihat
    return [m for m in similar.matches if m.id != product_id][:top_k]


# =============================================
# USE CASE 3: Duplicate Detection
# =============================================

def find_duplicates(texts, threshold=0.95):
    """Temukan teks yang hampir identik"""
    embeddings = [get_embedding(t) for t in texts]
    duplicates = []

    for i in range(len(texts)):
        results = index.query(
            vector=embeddings[i],
            top_k=5,
            include_metadata=True
        )
        for match in results.matches:
            if match.score >= threshold and match.id != f"doc_{i}":
                duplicates.append({
                    "original": texts[i],
                    "duplicate": match.metadata.get("text", ""),
                    "score": match.score
                })

    return duplicates


# =============================================
# USE CASE 4: Image Similarity Search
# =============================================
# pip install transformers torch
# from transformers import CLIPProcessor, CLIPModel

# def image_to_vector(image_path):
#     model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
#     processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
#
#     image = Image.open(image_path)
#     inputs = processor(images=image, return_tensors="pt")
#     vector = model.get_image_features(**inputs)
#     return vector.detach().numpy().flatten().tolist()
#
# def search_similar_images(image_path, top_k=5):
#     query_vector = image_to_vector(image_path)
#     return index.query(vector=query_vector, top_k=top_k)

8. Pinecone vs Alternatives

Database	Tipe	Kelebihan	Kekurangan
Pinecone	Managed cloud	Mudah, cepat, scalable	Vendor lock-in, biaya
Weaviate	Open-source / cloud	Modular, GraphQL API	Setup lebih kompleks
Qdrant	Open-source / cloud	Filter canggih, Rust	Ekosistem lebih kecil
Milvus	Open-source	Sangat scalable, GPU support	Butuh infra besar
ChromaDB	Open-source, embedded	Simple, Python-first	Belum production-ready
pgvector	PostgreSQL extension	Reuse PG infrastructure	Performa lebih rendah
FAISS	Library (bukan DB)	Sangat cepat, dari Meta	Tidak punya persistence

9. Best Practices & Optimasi

💡 Best Practices Vector Database

Chunking yang baik — pecah dokumen jadi 200-500 token, overlap 50-100 token
Konsistensi model — selalu gunakan model embedding yang SAMA untuk index dan query
Metadata filter — filter metadata sebelum vector search untuk efisiensi
Namespace — pisahkan data per use case (articles, products, users)
Batch upsert — jangan satu per satu, gunakan batch untuk efisiensi
Top-K selection — mulai dengan top_k=10-20, lalu re-rank untuk akurasi
Threshold — tetapkan minimum similarity score (0.7-0.8) untuk filter noise
Monitor — track recall@k dan latency untuk evaluasi performa

10. Quiz Pemahaman

Rangkuman

📝 Poin Penting

Vector Database — menyimpan vektor embedding, mencari berdasarkan kemiripan makna
Embeddings — ubah teks/gambar menjadi vektor angka berdimensi tinggi
Cosine similarity — metric paling umum untuk semantic search
Pinecone — managed vector DB, mudah setup, latensi rendah
RAG — gabungkan retrieval dari vector DB + LLM generation = jawaban akurat dari data Anda
Chunking — pecah dokumen jadi potongan kecil untuk retrieval presisi
Konsistensi model — selalu gunakan model embedding yang sama