LLM Tool Calling and Agents Overview

Intermediate 11 min read

What you'll learn

✓What tool calling actually is under the hood
✓How an agent loop reads, runs, and feeds tool results back
✓How to design tool schemas that the model can use reliably
✓Common failure modes and how to mitigate them
✓When you do not need an agent at all

Prerequisites

•Familiar with how APIs work

What and Why

A bare LLM can only output text. Tool calling is the mechanism that lets it ask your application to run a function and return the result. With tools, the same model can fetch live data, query a database, call another API, or perform a calculation it would otherwise hallucinate.

An agent is a loop on top of tool calling: the model proposes a tool call, your code executes it, you append the result to the conversation, and the model decides whether to call another tool or produce a final answer.

Mental Model

Tool calling is structured output, not magic. You give the model a JSON schema for each tool. The model decides whether to respond with plain text or with a tool_calls field containing the tool name and arguments. Your code is responsible for actually running the tool.

user message
 |
 v
LLM -> "tool_call: search(query='weather Tokyo')"
 |
 v
your code runs search() -> "26C, clear"
 |
 v
append tool result to messages
 |
 v
LLM -> "tool_call: convert_units(...)" OR "final: It is 26 degrees in Tokyo."
 |
 v
 (loop until final answer or max steps)

Agent loop with tool calling

The model never executes anything itself. It only describes which tool it wants and with what arguments. This separation is what makes tool calling safe to wire into production systems.

Hands-on Example

Here is a minimal agent loop using the OpenAI-style API.

from openai import OpenAI
import json

client = OpenAI()

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"],
        },
    },
}]

def get_weather(city):
    # Pretend this hits a real API.
    return {"city": city, "temp_c": 26, "conditions": "clear"}

def run_agent(user_msg, max_steps=5):
    messages = [{"role": "user", "content": user_msg}]
    for _ in range(max_steps):
        resp = client.chat.completions.create(
            model="gpt-4o-mini", messages=messages, tools=tools,
        )
        msg = resp.choices[0].message
        messages.append(msg)
        if not msg.tool_calls:
            return msg.content
        for call in msg.tool_calls:
            args = json.loads(call.function.arguments)
            result = get_weather(**args)
            messages.append({
                "role": "tool",
                "tool_call_id": call.id,
                "content": json.dumps(result),
            })
    return "Max steps reached."

print(run_agent("What's the weather in Tokyo?"))

The loop is the entire idea of an agent. Everything else (planners, memory, multi-agent setups) is an elaboration on this skeleton.

Trade-offs

Agents look powerful but introduce real costs.

Latency multiplies. Each tool call is another round trip to the model. A five-step agent is at minimum five LLM calls.
Cost multiplies. Each call resends the entire conversation including prior tool results. Long chains can be very expensive.
Failure surface grows. The model can hallucinate tool names, pass wrong types, or get stuck in loops calling the same tool.
Debugging is harder. You now have to inspect a trace of calls, not a single request.

In practice, many “agent” use cases are better served by a fixed workflow that calls tools in a predetermined order. Reach for an open-ended agent only when the branching is genuinely data-dependent.

Practical Tips

A few habits keep tool calling reliable.

Write tool descriptions for the model, not for humans. State exactly when to call the tool and what the inputs mean. Bad descriptions cause silent misuse.
Use strict JSON schema. Enable strict mode if your provider supports it so arguments are guaranteed to match the schema. This eliminates a whole class of parse errors.
Return structured, compact results. Tool outputs become input tokens on the next turn. Trim noise. Return JSON, not paragraphs.
Cap the loop. Always set a max_steps. Always log every step. A runaway agent can burn dollars in seconds.
Validate before executing. Even with strict schemas, validate ranges and business rules before you let a tool delete data or send money.
Prefer many small tools over one giant one. A search_orders(filters) tool is easier for the model to use correctly than a do_anything(query) tool.
Surface errors back as tool results. If a tool fails, return a JSON error object. The model can often recover by trying different arguments.

Wrap-up

Tool calling is the bridge from text generation to action. Once you can describe a function as a schema, you can extend the model’s reach to anything in your stack. Start with a single tool and a single loop. Add structure as you find real branching in your workflow. Resist the urge to build a generalized agent before you have a concrete problem that demands it.