Using the Anthropic Claude SDK in Python
A practical tour of the Anthropic Python SDK: messages, streaming, tool use, prompt caching, and patterns for shipping Claude in production.
What you'll learn
- ✓Call Claude with the Messages API
- ✓Stream tokens with the recommended helper
- ✓Define tools and run a tool loop
- ✓Cut cost with prompt caching
- ✓Handle thinking blocks and stop reasons
Prerequisites
- •Python 3.10 plus pip
- •Read [What is an LLM](/blog/what-is-an-llm)
- •Optional: [LLM Tool Use and Function Calling](/blog/llm-tool-use-and-function-calling)
The Anthropic SDK is small on purpose. One client, one Messages API, a few helpers. If you have used the OpenAI SDK you will feel at home, but there are real differences in tool use, system prompts, and caching.
Setup
pip install anthropic
export ANTHROPIC_API_KEY=sk-ant-...
import anthropic
client = anthropic.Anthropic()
A first call
System prompts are a top-level field, not a role inside messages. This is the most common gotcha when porting from OpenAI.
msg = client.messages.create(
model="claude-opus-4-7",
max_tokens=512,
system="You are concise and technical.",
messages=[
{"role": "user", "content": "Explain Raft in one sentence."},
],
)
print(msg.content[0].text)
max_tokens is required. Set it to the largest response you would actually accept.
Streaming
Use the stream helper. It handles event types and gives you a tidy iterator of text deltas.
with client.messages.stream(
model="claude-opus-4-7",
max_tokens=512,
messages=[{"role": "user", "content": "Count to five."}],
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
final = stream.get_final_message()
print("\nstop:", final.stop_reason)
stop_reason tells you whether the model ended naturally, hit max_tokens, or wants to call a tool.
Tool use
Tools are JSON schemas with a name, description, and input shape. Claude will respond with a tool_use content block when it wants to call one.
tools = [{
"name": "get_stock_price",
"description": "Return the current stock price.",
"input_schema": {
"type": "object",
"properties": {"ticker": {"type": "string"}},
"required": ["ticker"],
},
}]
resp = client.messages.create(
model="claude-opus-4-7",
max_tokens=512,
tools=tools,
messages=[{"role": "user", "content": "Price of NVDA?"}],
)
print(resp.stop_reason) # "tool_use"
for block in resp.content:
if block.type == "tool_use":
print(block.name, block.input)
The model can request several tools in parallel; iterate every tool_use block.
Completing the tool loop
You execute the tool, then send a user message containing a tool_result block keyed to the original tool_use_id.
def get_stock_price(ticker: str) -> str:
return f"{ticker} is at 142.10"
messages = [{"role": "user", "content": "Price of NVDA?"}]
resp = client.messages.create(
model="claude-opus-4-7", max_tokens=512, tools=tools, messages=messages,
)
messages.append({"role": "assistant", "content": resp.content})
tool_results = []
for block in resp.content:
if block.type == "tool_use":
result = get_stock_price(**block.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result,
})
messages.append({"role": "user", "content": tool_results})
final = client.messages.create(
model="claude-opus-4-7", max_tokens=512, tools=tools, messages=messages,
)
print(final.content[0].text)
That is the canonical agent step. Loop until stop_reason is not tool_use.
Prompt caching
Anthropic supports cache breakpoints on the system prompt and on tools. When the prefix is reused, you pay a fraction of the input cost and get faster responses. This is the single highest-leverage optimization in the SDK.
client.messages.create(
model="claude-opus-4-7",
max_tokens=512,
system=[
{"type": "text", "text": "You are a code reviewer."},
{"type": "text", "text": LARGE_STYLE_GUIDE, "cache_control": {"type": "ephemeral"}},
],
messages=[{"role": "user", "content": "Review this PR..."}],
)
The first call writes the cache, subsequent calls hit it. Put stable content before volatile content; cache hits require a matching prefix.
Extended thinking
Recent Claude models can use a thinking budget for harder problems. The thinking text comes back as a thinking block; do not feed it back as user content.
resp = client.messages.create(
model="claude-opus-4-7",
max_tokens=2048,
thinking={"type": "enabled", "budget_tokens": 1024},
messages=[{"role": "user", "content": "Plan a migration from REST to gRPC."}],
)
for b in resp.content:
if b.type == "text":
print(b.text)
Async client and retries
from anthropic import AsyncAnthropic
import asyncio
aclient = AsyncAnthropic(max_retries=3, timeout=30.0)
async def ask(q: str) -> str:
r = await aclient.messages.create(
model="claude-opus-4-7",
max_tokens=256,
messages=[{"role": "user", "content": q}],
)
return r.content[0].text
print(asyncio.run(ask("What is REST?")))
For an HTTP example, pair this with What is FastAPI and stream the response back to the client.
Wrap up
Claude rewards careful prompt structure: stable content first, cache it, then volatile content, then tools. Master the tool loop, cache aggressively, and the SDK feels less like an API and more like a primitive.