Skip to content
C Codeloom
LLMs

Using the Anthropic Claude SDK in Python

A practical tour of the Anthropic Python SDK: messages, streaming, tool use, prompt caching, and patterns for shipping Claude in production.

·4 min read · By Yash Kesharwani
Intermediate 10 min read

What you'll learn

  • Call Claude with the Messages API
  • Stream tokens with the recommended helper
  • Define tools and run a tool loop
  • Cut cost with prompt caching
  • Handle thinking blocks and stop reasons

Prerequisites

  • Python 3.10 plus pip
  • Read [What is an LLM](/blog/what-is-an-llm)
  • Optional: [LLM Tool Use and Function Calling](/blog/llm-tool-use-and-function-calling)

The Anthropic SDK is small on purpose. One client, one Messages API, a few helpers. If you have used the OpenAI SDK you will feel at home, but there are real differences in tool use, system prompts, and caching.

Setup

pip install anthropic
export ANTHROPIC_API_KEY=sk-ant-...
import anthropic
client = anthropic.Anthropic()

A first call

System prompts are a top-level field, not a role inside messages. This is the most common gotcha when porting from OpenAI.

msg = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=512,
    system="You are concise and technical.",
    messages=[
        {"role": "user", "content": "Explain Raft in one sentence."},
    ],
)
print(msg.content[0].text)

max_tokens is required. Set it to the largest response you would actually accept.

Streaming

Use the stream helper. It handles event types and gives you a tidy iterator of text deltas.

with client.messages.stream(
    model="claude-opus-4-7",
    max_tokens=512,
    messages=[{"role": "user", "content": "Count to five."}],
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)
    final = stream.get_final_message()
print("\nstop:", final.stop_reason)

stop_reason tells you whether the model ended naturally, hit max_tokens, or wants to call a tool.

Tool use

Tools are JSON schemas with a name, description, and input shape. Claude will respond with a tool_use content block when it wants to call one.

tools = [{
    "name": "get_stock_price",
    "description": "Return the current stock price.",
    "input_schema": {
        "type": "object",
        "properties": {"ticker": {"type": "string"}},
        "required": ["ticker"],
    },
}]

resp = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=512,
    tools=tools,
    messages=[{"role": "user", "content": "Price of NVDA?"}],
)
print(resp.stop_reason)  # "tool_use"
for block in resp.content:
    if block.type == "tool_use":
        print(block.name, block.input)

The model can request several tools in parallel; iterate every tool_use block.

Completing the tool loop

You execute the tool, then send a user message containing a tool_result block keyed to the original tool_use_id.

def get_stock_price(ticker: str) -> str:
    return f"{ticker} is at 142.10"

messages = [{"role": "user", "content": "Price of NVDA?"}]
resp = client.messages.create(
    model="claude-opus-4-7", max_tokens=512, tools=tools, messages=messages,
)
messages.append({"role": "assistant", "content": resp.content})

tool_results = []
for block in resp.content:
    if block.type == "tool_use":
        result = get_stock_price(**block.input)
        tool_results.append({
            "type": "tool_result",
            "tool_use_id": block.id,
            "content": result,
        })

messages.append({"role": "user", "content": tool_results})

final = client.messages.create(
    model="claude-opus-4-7", max_tokens=512, tools=tools, messages=messages,
)
print(final.content[0].text)

That is the canonical agent step. Loop until stop_reason is not tool_use.

Prompt caching

Anthropic supports cache breakpoints on the system prompt and on tools. When the prefix is reused, you pay a fraction of the input cost and get faster responses. This is the single highest-leverage optimization in the SDK.

client.messages.create(
    model="claude-opus-4-7",
    max_tokens=512,
    system=[
        {"type": "text", "text": "You are a code reviewer."},
        {"type": "text", "text": LARGE_STYLE_GUIDE, "cache_control": {"type": "ephemeral"}},
    ],
    messages=[{"role": "user", "content": "Review this PR..."}],
)

The first call writes the cache, subsequent calls hit it. Put stable content before volatile content; cache hits require a matching prefix.

Extended thinking

Recent Claude models can use a thinking budget for harder problems. The thinking text comes back as a thinking block; do not feed it back as user content.

resp = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=2048,
    thinking={"type": "enabled", "budget_tokens": 1024},
    messages=[{"role": "user", "content": "Plan a migration from REST to gRPC."}],
)
for b in resp.content:
    if b.type == "text":
        print(b.text)

Async client and retries

from anthropic import AsyncAnthropic
import asyncio

aclient = AsyncAnthropic(max_retries=3, timeout=30.0)

async def ask(q: str) -> str:
    r = await aclient.messages.create(
        model="claude-opus-4-7",
        max_tokens=256,
        messages=[{"role": "user", "content": q}],
    )
    return r.content[0].text

print(asyncio.run(ask("What is REST?")))

For an HTTP example, pair this with What is FastAPI and stream the response back to the client.

Wrap up

Claude rewards careful prompt structure: stable content first, cache it, then volatile content, then tools. Master the tool loop, cache aggressively, and the SDK feels less like an API and more like a primitive.