Skip to content
C Codeloom
RAG

Pinecone Vector Database: Index, Upsert, Query

Build a working vector search pipeline with Pinecone in Python. Indexes, upserts, metadata filters, hybrid search, and patterns for production RAG.

·4 min read · By Yash Kesharwani
Intermediate 9 min read

What you'll learn

  • Create a serverless Pinecone index from Python
  • Upsert vectors with metadata in batches
  • Run nearest-neighbor queries with filters
  • Use namespaces to isolate tenants
  • Tune top_k, dimensions, and metric for quality

Prerequisites

  • Read [RAG Embeddings Explained](/blog/rag-embeddings-explained)
  • Skim [RAG Vector Databases Overview](/blog/rag-vector-databases-overview)
  • Python 3.10 and an OpenAI or Anthropic key

Pinecone is a managed vector database. You hand it embeddings plus metadata, ask for nearest neighbors, and let someone else run the index. This guide walks the full loop end to end.

Install and authenticate

pip install pinecone openai
export PINECONE_API_KEY=pcsk-...
export OPENAI_API_KEY=sk-...
from pinecone import Pinecone, ServerlessSpec
pc = Pinecone()

Create an index

An index is a logical store. Pick the dimension that matches your embedding model. text-embedding-3-small is 1536.

INDEX = "docs"
if not pc.has_index(INDEX):
    pc.create_index(
        name=INDEX,
        dimension=1536,
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1"),
    )
index = pc.Index(INDEX)

Cosine is the default for embeddings. Use dot product only if your vectors are not normalized and you know why.

Upsert vectors

Each record needs an id, a vector, and optional metadata. Batch upserts. One vector at a time is fine for demos and miserable in production.

from openai import OpenAI
oai = OpenAI()

docs = [
    {"id": "d1", "text": "Pinecone is a managed vector database."},
    {"id": "d2", "text": "FAISS runs vector search in process."},
    {"id": "d3", "text": "pgvector adds vector types to Postgres."},
]

emb = oai.embeddings.create(
    model="text-embedding-3-small",
    input=[d["text"] for d in docs],
)

vectors = [
    {"id": d["id"], "values": e.embedding, "metadata": {"text": d["text"]}}
    for d, e in zip(docs, emb.data)
]

index.upsert(vectors=vectors, namespace="default")

Keep metadata small. Store full documents elsewhere and reference them by id.

Querying

Embed the query with the same model, then ask the index for top-k.

q = "Which database can I run inside Postgres?"
qvec = oai.embeddings.create(model="text-embedding-3-small", input=q).data[0].embedding

res = index.query(
    vector=qvec,
    top_k=3,
    namespace="default",
    include_metadata=True,
)
for match in res.matches:
    print(round(match.score, 3), match.metadata["text"])

Higher score means closer. With cosine, scores live in roughly the 0 to 1 band on normalized embeddings.

Metadata filters

You can filter by metadata at query time. This is fast because Pinecone applies the filter inside the search.

index.upsert(vectors=[
    {"id": "p1", "values": qvec, "metadata": {"team": "platform", "year": 2025}},
    {"id": "p2", "values": qvec, "metadata": {"team": "growth", "year": 2024}},
], namespace="default")

res = index.query(
    vector=qvec,
    top_k=5,
    filter={"team": {"$eq": "platform"}, "year": {"$gte": 2025}},
    include_metadata=True,
    namespace="default",
)

Supported operators include $eq, $ne, $gt, $gte, $lt, $lte, $in, $nin.

Namespaces for multi-tenant apps

A namespace is a partition inside the index. Queries are scoped to a single namespace, which is the cleanest way to isolate per-tenant data.

index.upsert(vectors=vectors, namespace="tenant_acme")
res = index.query(vector=qvec, top_k=3, namespace="tenant_acme")

One index, many namespaces. Cheaper than spinning up an index per customer.

For keyword-heavy queries, mix sparse and dense vectors. Sparse vectors are token-frequency style; dense vectors are embeddings. Pinecone supports both, and you fuse the scores.

res = index.query(
    vector=qvec,
    sparse_vector={"indices": [10, 45, 102], "values": [0.5, 0.3, 0.2]},
    top_k=5,
)

Hybrid wins on queries with rare proper nouns, product SKUs, or code identifiers that embeddings smear together.

Updates and deletes

index.update(id="d1", set_metadata={"reviewed": True}, namespace="default")
index.delete(ids=["d2"], namespace="default")
index.delete(filter={"team": {"$eq": "growth"}}, namespace="default")

For full re-embedding, upsert with the same id and the new vector. There is no migration step.

Wiring it into a RAG endpoint

def retrieve(q: str, k: int = 4) -> list[str]:
    v = oai.embeddings.create(model="text-embedding-3-small", input=q).data[0].embedding
    r = index.query(vector=v, top_k=k, namespace="default", include_metadata=True)
    return [m.metadata["text"] for m in r.matches]

def answer(q: str) -> str:
    ctx = "\n".join(retrieve(q))
    msg = oai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Answer using the context. If unknown say so."},
            {"role": "user", "content": f"Context:\n{ctx}\n\nQ: {q}"},
        ],
    )
    return msg.choices[0].message.content

print(answer("Which vector DB lives inside Postgres?"))

Front this with FastAPI; see What is FastAPI.

Tuning that matters

Use a smaller embedding when latency rules, a larger one when quality rules. Set top_k based on how much context your prompt budget allows. Chunk source documents to roughly 200 to 500 tokens. Re-rank the top 20 with a stronger model if precision matters.

Wrap up

Pinecone is boring in the best way. Pick a dimension, upsert in batches, query with filters, isolate tenants with namespaces, and let the managed plane handle the rest.