What the model CAN do, and what it WILL do

Tool use and safety are two sides of the same coin. Tools extend capabilities (access to the web, running code, reading files) beyond what the weights contain. Safety constrains behavior so the model refuses dangerous requests and calibrates when to help vs when to decline.

Framing

A well-trained model is a careful balance: broad capability + calibrated refusal. Too little capability -> useless. Too little safety -> dangerous. The goal is calibration, not maximization.

Why Tools Matter

LLMs have three hard limits:

Tool use solves all three by letting the model delegate. Instead of guessing today's stock price, it calls a get_price() function. Instead of computing sqrt(17689) in its head, it calls a calculator.

How Function Calling Works

Interactive: Tool Use Flow Diagram

Click each stage to see what happens.

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string"}}
        }
    }
}]

response = openai.chat.completions.create(
    model="gpt-4", messages=messages, tools=tools
)
# Model output: {"tool_call": "get_weather", "arguments": {"location": "Paris"}}
# App runs the function, feeds result back, model writes final response

Critical clarification: the model does not execute anything. It outputs structured text naming the function and its arguments. The application code executes the function and passes the result back. The model is a decision-maker, not an actor.

Interactive: Function Calling Demo

Interactive: Function Calling Demo

Pick a prompt. See which tool (if any) the model decides to call, what arguments it picks, what the tool returns, and how the model assembles a final answer.

Training for tool use is done via SFT on tool-use datasets (Gorilla, ToolBench with 16k+ APIs, synthetic function-calling scenarios). The standard benchmark is the Berkeley Function Calling Leaderboard (BFCL).

MCP: The Model Context Protocol

As tool ecosystems exploded, every AI application needed custom integrations -- the M x N problem (M models x N tools). Anthropic introduced MCP in November 2024 as an open standard with three primitives:

Adoption was fast: OpenAI adopted MCP in March 2025; Anthropic donated it to the Linux Foundation in December 2025. Security research has identified real vulnerabilities in MCP deployments (tool name spoofing, credential leakage, prompt injection through resources) -- the protocol is useful but not safe by default.

Agents, ReAct, and When Agents Fail

An agent is an LLM in a loop: reason, act, observe results, reason again. The classic pattern is ReAct (Yao et al. 2022):

Thought 1: I need the capital of France.
Action 1:  search("capital of France")
Observation 1: Paris is the capital of France.
Thought 2: Now I need the population of Paris.
Action 2:  search("population of Paris")
Observation 2: 2.1 million (city proper).
Answer: 2.1 million.

Common agent failure modes:

Computer-use agents (Anthropic Computer Use Oct 2024, OpenAI Operator Jan 2025) control a computer via screenshots + virtual mouse/keyboard. On OSWorld, OpenAI CUA scored 38.1%, Claude 3.5 Sonnet ~22%, vs humans at ~72%. We are still far from reliable general-purpose agents.

Safety Training and Jailbreaks

Every LLM contains knowledge that could be misused. Preference optimization alone is not enough -- adversarial users bypass safety through jailbreaks:

Wei et al. (2023) identify two root causes: competing objectives (helpful vs safe) and generalization mismatch (safety training doesn't cover all attack patterns). Their conclusion: jailbreaks may be inherent to current safety training.

Red-teaming

Deliberately trying to break your own model. Manual red-teaming finds nuanced bypasses but doesn't scale. Automated red-teaming (HarmBench, GPTFUZZER) uses LLMs to generate attack prompts. Output of red-teaming becomes training data for the next safety iteration. It is an arms race.

Refusal Calibration: Not Too Much, Not Too Little

The goal is not maximum refusal. It is calibration: refuse what should be refused, and only that.

Interactive: Safety Guardrail Explorer

For each prompt, decide whether a well-calibrated model should refuse. Compare against what a naive keyword-based filter would do.

Calibration strategies: boundary datasets (examples near the safety edge), contrastive pairs (same request, different intent), explicit policy specification, Constitutional AI (Anthropic, 2022) where the model self-critiques using a set of principles, and circuit breakers (directly modify internal representations when harmful content is detected -- 87-90% refusal on HarmBench while preserving helpfulness).

Evaluation: HarmBench (automated red-teaming), XSTest and OR-Bench (over-refusal on borderline-but-benign prompts), SORRY-Bench (44 fine-grained categories). Safety is not a single number.

Check Your Understanding

1. Does the LLM actually execute the function when you use function calling?
Correct: No, it outputs a structured description of the call; application code executes it and feeds results back
2. What problem does MCP solve?
Correct: The M x N integration problem: M models x N tools each needing a custom glue layer
3. What is the ReAct pattern?
Correct: Reasoning + Acting in a loop: thought -> action -> observation -> thought, until the task is done
4. What are the two root causes of jailbreaks identified by Wei et al.?
Correct: Competing objectives (helpful vs safe) and generalization mismatch (safety training does not cover all attacks)
5. What is the over-refusal problem?
Correct: The model refuses reasonable requests because of keyword sensitivity (e.g., "kill a process" triggered by "kill")

Teach It Back

Explain to a friend: How does function calling actually work (who executes the function?), what problem does MCP solve, what is an agent and why do agents fail, and what does 'calibrated refusal' mean in safety training? Include examples of jailbreaks and how labs defend against them.

An AI tutor will compare your explanation against the course material.

Evaluating...

Flashcards (click to flip)

How does function calling work?
Click to reveal
App provides tool schemas. Model outputs structured text naming the tool and its arguments. App executes the tool and feeds the result back. Model writes a final response incorporating the result. The model never executes anything itself.
What is MCP and what problem does it solve?
Click to reveal
Model Context Protocol -- open standard (Anthropic 2024) with three primitives: tools (model-controlled), resources (app-controlled), prompts (user-controlled). Solves the M models x N tools integration explosion.
What is the ReAct pattern?
Click to reveal
Reasoning + Acting loop. Model emits Thought -> Action (tool call) -> Observation (tool result) -> Thought, repeated until done. Enables multi-step agents but vulnerable to infinite loops, error cascading, context overflow.
Two root causes of jailbreaks?
Click to reveal
1. Competing objectives: helpfulness vs harmlessness can be manipulated against each other. 2. Generalization mismatch: safety training cannot cover every possible attack pattern. Wei et al. 2023 argue these may be inherent to current methods.
Over-refusal vs under-refusal?
Click to reveal
Over-refusal: declining reasonable requests (keyword sensitivity -> "kill a process" refused). Erodes trust. Under-refusal: complying with requests that should be declined. Exploited by jailbreaks. Goal is calibration, not maximum refusal.
Safety evaluation benchmarks?
Click to reveal
HarmBench (automated red-teaming), XSTest / OR-Bench (over-refusal on benign prompts), SORRY-Bench (44 fine-grained categories), WildJailbreak (adversarial attacks). Safety is multi-dimensional: a model can be robust on one category and broken on another.