Skip to content
C Codeloom
RAG

RAG Tracing with LangSmith Tutorial

Use LangSmith to trace, debug, and evaluate RAG pipelines step by step, from instrumentation to dataset replay and regression detection.

·4 min read · By Codeloom
Intermediate 10 min read

What you'll learn

  • Why RAG needs trace-level observability
  • Core LangSmith concepts: runs, traces, datasets
  • How to instrument a Python RAG app
  • How to debug retrieval failures with traces
  • How to set up a regression evaluation loop

Prerequisites

  • Have run a basic RAG pipeline before

What and Why

A RAG pipeline is a chain of small black boxes: query rewriting, retrieval, reranking, prompt building, and generation. When the final answer is wrong, you cannot tell which box broke without looking inside each one. Print statements scale to about ten users and zero engineers.

LangSmith is a tracing and evaluation platform that records every step of an LLM application as a structured run tree. You see exact prompts, retrieved chunks, latencies, and token counts for each call, and you can replay them against new versions of your code.

Mental Model

Every operation in a LangSmith trace is a “run”. Runs nest into a tree that mirrors your call stack, so a top level chain run contains retriever, prompt, and LLM child runs. Each run records inputs, outputs, errors, timing, and metadata.

Runs collect into “projects” for environment isolation, and into “datasets” for evaluation. A dataset is a pinned set of inputs with optional reference outputs, and you run your code over the dataset to compare versions over time.

Hands-on Example

Install and authenticate, then turn on tracing with two environment variables.

pip install langsmith langchain
export LANGSMITH_TRACING=true
export LANGSMITH_API_KEY=ls_...
export LANGSMITH_PROJECT=rag-prod

Wrap your retrieval and generation functions with the traceable decorator. Anything called inside automatically nests.

from langsmith import traceable

@traceable(run_type="retriever")
def retrieve(query):
    return vector_db.search(embed(query), k=5)

@traceable(run_type="chain")
def rag(query):
    chunks = retrieve(query)
    prompt = build_prompt(query, chunks)
    return llm.invoke(prompt)

Each call to rag now shows up as a trace tree in the LangSmith UI.

rag (chain) - 1.8s
|
+-- retrieve (retriever) - 120ms
|     input:  "how do I rotate IAM keys?"
|     output: [chunk_a, chunk_b, chunk_c, chunk_d, chunk_e]
|
+-- build_prompt (tool) - 4ms
|     output: "Use the following context..."
|
+-- llm.invoke (llm) - 1.6s
      tokens_in: 1820  tokens_out: 142
      output: "Rotate IAM access keys via..."
A LangSmith trace tree mirrors the structure of a RAG call

To debug a bad answer, open the trace, expand the retriever run, and check the chunks. Nine times out of ten the answer is missing because retrieval did not return the right chunk, not because the LLM is dumb.

Trade-offs

Tracing every request gives perfect visibility but costs latency and storage. The LangSmith SDK batches uploads in the background, so user latency stays low, but you do pay for each trace at scale. Use sampling, for example a fixed percentage of production traces plus all error traces, when you go to high volume.

Hosted LangSmith is the fastest path but sends prompts and outputs off your infrastructure. For regulated data, self hosted LangSmith or careful payload redaction is needed.

Compared to building your own logging, LangSmith gives you a UI, dataset versioning, and evaluators for free. The downside is vendor lock in: the traceable decorator and dataset format are LangSmith specific.

Practical Tips

Tag runs with stable metadata like version, model, and tenant. Filtering by these in the UI is how you compare deployments.

Save bad traces to a dataset directly from the UI. Over a few weeks you build a real regression suite from actual user failures rather than synthetic prompts.

Add evaluators that score retrieval quality separately from answer quality. A simple one: does the reference answer text appear in any retrieved chunk? Run it on every dataset run and watch the trend.

Use Feedback objects to attach human ratings or production thumbs up and down to runs. These become labels for offline evaluation later.

Redact PII before it enters a trace. Wrap inputs with a scrubber so personal data never leaves your service, or run a self hosted instance.

Pin a baseline run for every dataset. Future runs compare against it automatically, so a model swap or prompt edit that drops accuracy shows up immediately rather than weeks later.

Wrap-up

LangSmith turns a RAG pipeline from a guess machine into a debuggable system. Instrument with traceable, look at trace trees when answers go wrong, save failure cases into datasets, and run evaluators to catch regressions. The setup takes an afternoon and pays for itself the first time you ship a prompt change without breaking production.