Function Calling with LLMs: Production Patterns

Intermediate 10 min read

What you'll learn

✓What function calling actually is at the model level
✓How to design tool schemas that survive ambiguity
✓When to use parallel vs sequential tool calls
✓How to validate and gate side-effects
✓How to debug failed tool chains

Prerequisites

•Familiar with how APIs work
•Basic LLM usage

What and Why

Function calling (sometimes called tool use) lets an LLM emit a structured request to invoke an external function instead of producing free-form text. You declare a set of tools with names, descriptions, and JSON-schema parameters. The model decides when to call one, with what arguments. Your code executes the call, returns the result, and the model continues.

This is the foundation of agents, plugin systems, retrieval pipelines, and any application where the LLM needs to talk to real systems. It is also the place where naive implementations break the most spectacularly in production.

Mental Model

The model is not “calling” anything. It is generating text constrained to a structured format. Behind the scenes, providers use techniques like constrained decoding, fine-tuning on tool-use traces, and grammar-based sampling so the output matches your schema. Your application receives a tool call object, executes the side effect, and feeds the result back.

turn 1 -> model: "I should call get_weather(city='Paris')"
       -> app: executes, returns 18C, sunny
turn 2 -> model uses result to answer the user

There is no callback, no RPC, no daemon. Function calling is a protocol for structured outputs that the model has been trained to emit reliably.

Architecture

User question
   |
   v
+--------+    tool_calls    +-----------+
|  LLM   |----------------->| Validator |
+--------+                  +-----------+
   ^                            |
   |    tool_result             v
   |                       +---------+
   +-----------------------|  Tools  |
                           | (APIs)  |
                           +---------+

Loop: model -> tool calls -> results -> model -> ...
Terminates when model produces a final assistant message.

Function-calling loop

A production loop has several layers:

Schema definition: JSON Schema for parameters. Strong typing and tight enums prevent hallucinated values.
Permission gate: not every tool should be callable in every context. Some require user approval for destructive actions.
Validator: validate arguments before execution. Reject and feed the error back to the model so it can correct itself.
Executor: the actual function. Should be idempotent or guarded by an idempotency key.
Result formatter: trim, summarize, or truncate results before feeding back to the model. A 200KB API response will blow your context budget.

Trade-offs

Parallel vs sequential tool calls. Modern models can emit multiple tool calls in one turn. Parallel is faster and cheaper when calls are independent (get_weather and get_calendar). Sequential is required when one call’s result feeds the next. Let the model decide; provide tools that compose naturally.

Strict vs loose schemas. Strict JSON schema with additionalProperties: false catches hallucinated fields. It also reduces the model’s flexibility when arguments are genuinely optional. Default to strict; relax only where you need to.

Tool count. Beyond about 20 tools, models start mis-routing. The description text for every tool sits in the prompt; the model has to choose. If you have a tool zoo, gate them: first call a router tool that returns relevant tools for the user’s intent, then issue the real call.

Hallucinated tools. Without proper provider support, models will sometimes invent tool names. Always treat the tool name as untrusted; reject unknown names with a clear error so the model can recover.

Error feedback loops. When a tool fails, return a structured error to the model: {"error": "validation_failed", "field": "date", "reason": "must be ISO-8601"}. The model can correct and retry. Don’t return raw stack traces.

Cost of retries. Each tool call round trip is another model invocation. A poorly-designed loop can cost ten calls for a task that should take two. Cap iterations and surface partial results when the cap is hit.

Side effects. The model can hallucinate confidence. It will sometimes call delete_account when the user asked something innocuous. Destructive tools should require explicit user confirmation outside the model’s loop.

Practical Tips

Name tools like functions, describe them like docs. search_orders(user_id, status) with a one-sentence description (“Find orders for a user. Use this when the user asks about their order history.”). Models route on description, not name.
Use enums where possible. status: "pending"|"shipped"|"cancelled" beats status: string. Models stick to enums reliably.
Truncate tool outputs. Return at most a few KB of text. Summarize, paginate, or return IDs the model can fetch in a follow-up.
Make tools idempotent. The model may retry. The retry should not double-charge, double-send, or double-delete.
Validate before executing. Cheap rejection prevents expensive damage. JSON Schema validation in your code, not just provider-side.
Surface tool calls in logs. Every tool call should be a structured log line with arguments, latency, and outcome. This is your single most useful debugging artifact.
Cap loop depth. Set a max-turns limit (8 is reasonable for most tasks). If the model is still looping, something is wrong; return what you have.
Stream the final answer, not the tool calls. Tool calls should complete server-side; only the assistant’s final response should stream to the user. Otherwise you leak internal reasoning.

{
  "name": "search_orders",
  "description": "Find orders for the current user. Filter by status.",
  "parameters": {
    "type": "object",
    "properties": {
      "status": {"type": "string", "enum": ["pending", "shipped", "cancelled"]},
      "limit":  {"type": "integer", "minimum": 1, "maximum": 50, "default": 10}
    },
    "required": ["status"],
    "additionalProperties": false
  }
}

Wrap-up

Function calling turns LLMs from text generators into orchestrators. The model decides what to do; your code decides what is allowed. That separation is the safety boundary, and it is where you should invest engineering effort. Good schemas, strict validation, idempotent tools, structured error feedback, and bounded loops give you a system you can reason about and debug. Skip those, and you have a chatbot that occasionally deletes things in your database. Pick the boring version.