LLM Tool Use and Function Calling Explained

Intermediate 11 min read

What you'll learn

✓Why giving an LLM tools changes what it can do
✓How to define a tool with a JSON schema
✓The request → tool-call → response loop
✓Common patterns: search, calculators, database queries
✓The failure modes that show up in production

Prerequisites

•Comfort with calling an LLM API — see What Is an LLM?
•Basic JSON knowledge

A bare LLM can write text. That is a lot, but it is also a ceiling. It cannot check today’s stock price, run a SQL query, or send an email. Tool use — sometimes called function calling — is how you punch through that ceiling. You hand the model a menu of callable functions, and it decides when to use them.

This post explains how the mechanism actually works, the patterns that hold up in production, and the failure modes worth knowing before you ship.

Why give an LLM tools

LLMs are good at language and reasoning. They are bad at:

Exact arithmetic. Token prediction does not multiply seven-digit numbers reliably.
Fresh facts. Training data has a cutoff.
Looking things up. Document stores, databases, and APIs sit outside the model.
Side effects. Sending mail, writing rows, calling external services.

Tools let the model delegate. Instead of guessing the answer, it emits a structured call — “please run get_weather(city='Tokyo')” — and your code executes that. The output flows back into the conversation, and the model continues with real data in hand.

This is the foundation of nearly every useful “AI agent” you have seen.

What a tool actually is

A tool is three things:

A name the model can reference.
A description of what it does, in plain English.
A schema describing its arguments, usually JSON Schema.

The model never executes anything. It only emits a JSON object saying which tool to call and what arguments to pass. Your application reads that, runs the function, and feeds the result back.

A typical definition:

# A tool definition the model can see
get_weather_tool = {
    "name": "get_weather",
    "description": "Return current weather for a city. Use when the user asks about weather conditions.",
    "input_schema": {
        "type": "object",
        "properties": {
            "city": {"type": "string", "description": "City name, e.g. 'Tokyo'"},
            "units": {"type": "string", "enum": ["c", "f"], "default": "c"},
        },
        "required": ["city"],
    },
}

Two things matter here:

The description is the prompt the model actually reads to decide whether to call this tool. Treat it as prompt engineering, not API docs.
The schema constrains the JSON the model emits. Providers run a validator; bad calls are usually rejected or repaired before they reach you.

The request → tool-call → response loop

The interaction is not one round trip. It is a small loop.

# Conceptual flow — every provider looks roughly like this
messages = [{"role": "user", "content": "What's the weather in Tokyo?"}]

while True:
    response = client.chat(model="some-model", messages=messages, tools=[get_weather_tool])

    if response.stop_reason == "tool_use":
        # Model asked to call a tool
        call = response.tool_call
        result = run_tool(call.name, call.arguments)  # your code

        # Feed the result back in
        messages.append({"role": "assistant", "tool_call": call})
        messages.append({"role": "tool", "name": call.name, "content": result})
        continue

    # No more tool calls — final text answer
    print(response.text)
    break

The loop ends when the model is satisfied and emits plain text. A single user question can trigger zero, one, or many tool calls in sequence. Some models will also emit several tool calls in parallel when they are independent.

Try it. Pick any LLM API you have access to and define one trivial tool — say add(a, b). Ask the model “what is 9871 times 2349?” without the tool, then with it. Notice the difference. The model that knows it can call add will usually decompose the problem and call the tool, while the bare model bluffs.

Common patterns

A few tool shapes show up over and over.

Search

{
  "name": "search_docs",
  "description": "Search the product documentation. Use for questions about features, configuration, or APIs.",
  "input_schema": {
    "type": "object",
    "properties": { "query": { "type": "string" } },
    "required": ["query"]
  }
}

Pairs naturally with retrieval — see What Is RAG?. The tool runs a vector or keyword search and returns the top chunks. The model reads them and answers.

Calculator

A thin wrapper around safe arithmetic. Cheap to provide, eliminates a whole class of hallucinated numbers.

Database query

{
  "name": "run_sql",
  "description": "Run a read-only SQL query against the analytics warehouse. Tables: orders, customers, products.",
  "input_schema": {
    "type": "object",
    "properties": { "sql": { "type": "string" } },
    "required": ["sql"]
  }
}

Powerful and dangerous. Always run as a read-only role, always wrap in a timeout, and consider showing the SQL to the user before executing it.

Action tools

send_email, create_ticket, book_meeting. These have side effects. Treat them as gated — require a confirmation step or human approval before the call actually runs.

Failure modes

Tools do not make the model correct. They give it more ways to be wrong.

Wrong tool, wrong moment. The model calls search_docs when the user just said “thanks.” Tighten the description: “Use only when the user asks a factual question about the product.”

Hallucinated arguments. Asked to look up a customer, the model invents a plausible-looking ID. Mitigation: validate IDs against your DB before acting, and have the tool return a clear error so the model can correct itself.

Infinite loops. The model calls a search tool, gets nothing useful, calls it again with a slightly different query, repeats. Always cap the loop — five to ten iterations is plenty.

Tool-call sprawl. Twenty tools, all with vague descriptions, none chosen well. Keep the menu small. If you need more, group them under a dispatcher tool.

Silent schema drift. You change an argument from user_id to userId. The model — trained on the old schema in a system prompt cached somewhere — keeps emitting the old name. Version your tool definitions and treat them as part of your API surface.

Cost. Each tool call is another model round trip. A chatty agent can burn through tokens fast. Log call counts per session and alert on outliers.

Design tips that hold up

One job per tool. get_user_profile is better than user_admin(action="get_profile", ...). The model picks cleaner verbs.
Descriptions are prompts. Spend time on them. Include when to use the tool and when not to.
Return structured errors. {"error": "city_not_found", "message": "..."} lets the model recover. A bare 500 does not.
Keep results small. Massive tool outputs eat the context window. Summarise or paginate before returning.
Make tools idempotent where you can. Retries are common; a tool that double-charges on retry is a bug waiting to happen.

Design exercise. Pick a workflow you already automate — a Slack command, a CLI script. Sketch what its tool definitions would look like for an LLM. Notice which arguments are obvious and which need careful descriptions. That gap is where most production bugs live.

Where tool use fits in the stack

Tool use is one of three big levers for grounding an LLM:

Prompting — instructions and examples. See Prompt Engineering Basics.
Retrieval — give the model relevant context. See What Is RAG?.
Tools — let the model act.

Most real applications use all three. A support agent retrieves docs (RAG), follows a system prompt that defines its persona (prompting), and can escalate a ticket or check an order status (tools).

The skill is knowing which lever to pull. If the answer lives in a document, retrieval is cheaper than a tool. If it requires fresh data or a side effect, you need a tool.

Recap

A tool is a name, a description, and a JSON-schema for arguments
The model emits structured tool calls; your code executes them and feeds results back
The interaction is a loop that ends when the model produces plain text
Common patterns: search, calculator, DB query, action tools
Failure modes: wrong tool, hallucinated args, loops, sprawl, silent drift
Tool descriptions are prompts — write them carefully

Next steps

Tools let the model act. Evaluations tell you whether that action is any good — see how to actually measure LLM quality next.

→ Next: LLM Evaluation: Measuring What Actually Matters

Questions or feedback? Email codeloomdevv@gmail.com.