Skip to content
C Codeloom
LLMs

Generating Text Embeddings with an API

A practical tutorial on text embeddings: vector intuition, calling an embeddings API, choosing dimensions, computing cosine similarity, and caching the results.

·7 min read · By Yash Kesharwani
Intermediate 9 min read

What you'll learn

  • A clear mental model of what a text embedding actually is
  • How to call an embeddings API and batch requests efficiently
  • How to pick embedding dimensions and what tradeoffs they imply
  • How to compute cosine similarity correctly in JS and Python
  • How to cache embeddings so you do not pay twice

Prerequisites

  • A quick read of [embeddings explained](/blog/rag-embeddings-explained)
  • Comfort with HTTP APIs and basic linear algebra

Embeddings are the lingua franca of modern retrieval. They turn arbitrary text into fixed-length vectors so that similarity in meaning becomes similarity in geometry. If you understand five concepts — vectors, the embedding call, dimensions, cosine similarity, and caching — you can build search, recommendations, deduplication, and the retrieval half of a RAG system. This tutorial walks through each in code.

The mental model

An embedding model is a function f: text -> R^d. Give it a string, get back an array of d floating-point numbers. The model is trained so that texts with similar meaning land near each other in that d-dimensional space.

You do not need to know how the model produces the vector. You only need three properties:

  1. Deterministic for the same input (modulo version changes).
  2. Stable in length. Every output has the same dimension.
  3. Comparable by a distance metric, usually cosine similarity.

A useful intuition: imagine projecting every sentence in your corpus onto a high-dimensional sphere. Sentences about “Python virtual environments” cluster on one patch; sentences about “Linux file permissions” cluster on another. To find relevant docs for a query, you embed the query and look at the patch it lands in.

Calling an embeddings API

Most providers expose a single endpoint. The contract is “send strings, receive vectors.” A representative request:

curl https://api.example.com/v1/embeddings \
  -H "authorization: Bearer $API_KEY" \
  -H "content-type: application/json" \
  -d '{
    "model": "text-embed-3",
    "input": ["how do I list files in a directory?"]
  }'

Response:

{
  "model": "text-embed-3",
  "data": [
    { "index": 0, "embedding": [0.0123, -0.0456, ...] }
  ],
  "usage": { "input_tokens": 9 }
}

In Python:

import os
from openai import OpenAI  # works against any OpenAI-compatible endpoint

client = OpenAI(api_key=os.environ["API_KEY"])

def embed(texts: list[str]) -> list[list[float]]:
    resp = client.embeddings.create(
        model="text-embed-3",
        input=texts,
    )
    # Sort by index because providers do not guarantee order
    return [d.embedding for d in sorted(resp.data, key=lambda x: x.index)]

In JavaScript:

async function embed(texts) {
  const res = await fetch('https://api.example.com/v1/embeddings', {
    method: 'POST',
    headers: {
      authorization: `Bearer ${process.env.API_KEY}`,
      'content-type': 'application/json',
    },
    body: JSON.stringify({ model: 'text-embed-3', input: texts }),
  });
  const json = await res.json();
  json.data.sort((a, b) => a.index - b.index);
  return json.data.map(d => d.embedding);
}

Always batch. Most APIs accept hundreds or thousands of inputs per call and the per-request overhead is significant. A common pattern is to chunk your corpus into batches of 64-256 strings and run a small concurrency pool of 4-8 requests.

Dimensions and the cost-quality tradeoff

Modern embedding models often let you request a smaller dimension than the model’s native size. A model trained at 1536 dimensions may let you ask for 768 or 256 with a dimensions parameter. Smaller vectors mean:

  • Cheaper storage. A million 1536-dim float32 vectors is 6 GB. At 256 dims, it is 1 GB.
  • Faster search. Approximate nearest neighbor (ANN) indexes scale roughly with d.
  • Slightly lower quality. Recall on hard queries drops, but often less than you would expect.

A reasonable approach: start at the model’s default, build the pipeline end to end, then run an eval (see LLM evaluation basics) at 1536, 768, and 256 and pick the smallest that meets your quality bar.

You can also store vectors as float16 or even int8 with quantization. The accuracy loss is usually negligible for retrieval, and storage halves or quarters.

Cosine similarity, done correctly

Cosine similarity between two vectors a and b is the dot product divided by the product of their magnitudes:

cos(a, b) = (a · b) / (||a|| * ||b||)

It ranges from -1 (opposite) to 1 (identical direction). For embeddings from the same model, values typically fall in a narrow positive band; a “similar” pair might score 0.7 while “unrelated” pairs sit around 0.2.

In NumPy:

import numpy as np

def cosine(a: np.ndarray, b: np.ndarray) -> float:
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

# Vectorized over a matrix of candidates
def cosine_matrix(query: np.ndarray, corpus: np.ndarray) -> np.ndarray:
    q = query / np.linalg.norm(query)
    c = corpus / np.linalg.norm(corpus, axis=1, keepdims=True)
    return c @ q

In plain JavaScript:

function cosine(a, b) {
  let dot = 0, na = 0, nb = 0;
  for (let i = 0; i < a.length; i++) {
    dot += a[i] * b[i];
    na  += a[i] * a[i];
    nb  += b[i] * b[i];
  }
  return dot / (Math.sqrt(na) * Math.sqrt(nb));
}

A trick worth knowing: if you normalize vectors once at write time (divide by their L2 norm), cosine similarity reduces to a plain dot product at query time. Most vector databases do this for you, but if you are rolling your own, store normalized vectors and skip the divisions on every search.

Caching embeddings

Embeddings are deterministic for the same (model, input) pair, which means they are perfect candidates for caching. A simple disk cache keyed on a hash of the inputs can cut your embedding bill by an order of magnitude during development.

import hashlib, json, sqlite3

def cache_key(model: str, text: str) -> str:
    h = hashlib.sha256()
    h.update(model.encode())
    h.update(b"\x00")
    h.update(text.encode())
    return h.hexdigest()

class EmbeddingCache:
    def __init__(self, path: str):
        self.db = sqlite3.connect(path)
        self.db.execute(
            "CREATE TABLE IF NOT EXISTS emb (k TEXT PRIMARY KEY, v BLOB)"
        )

    def get_many(self, keys: list[str]) -> dict[str, list[float]]:
        rows = self.db.execute(
            f"SELECT k, v FROM emb WHERE k IN ({','.join('?'*len(keys))})",
            keys,
        ).fetchall()
        return {k: json.loads(v) for k, v in rows}

    def put_many(self, items: dict[str, list[float]]) -> None:
        self.db.executemany(
            "INSERT OR REPLACE INTO emb VALUES (?, ?)",
            [(k, json.dumps(v)) for k, v in items.items()],
        )
        self.db.commit()

Wrap your embed() function so it computes only the cache misses, then merges results in order. For a production system the cache becomes part of your vector database: the database is the cache, and rebuilds re-embed only changed documents.

Two cache invariants to enforce:

  • Key on the model name. When you upgrade models, the vector space changes and old vectors are incomparable to new ones. Different model strings must produce different cache keys.
  • Normalize the input. Decide once whether you embed raw text, lowercased text, or text with collapsed whitespace, and apply that normalization before hashing. Otherwise “Hello” and “hello” pay twice.

Putting it together

A minimal flow for a search index:

  1. Chunk your documents.
  2. For each chunk, compute or look up its embedding.
  3. Store normalized vectors in a database, alongside the chunk text and metadata.
  4. At query time, embed the query, run a top-k similarity search, and return the chunks.

Steps 2 and 3 are the embeddings API; step 4 is the vector database. The hard part of a real system is rarely the embedding call — it is the chunking and evaluation around it.

Wrap up

Embeddings turn text into vectors, vectors into geometry, and geometry into search. The API is a single endpoint, the math is one formula, and the engineering is mostly about batching, caching, and picking a sensible dimension. Build that pipeline once and the same primitive powers search, recommendations, clustering, and RAG.