Measuring Tool Use Correctness & Plan Adherence

Key Takeaways

The Reality of Agent Failures: In production, agent failure isn't just about bad text generation; it's about two distinct domains: Schema Correctness (invalid JSON) and Logical Correctness (wrong business data).
Middleware as an Observation Deck: Intercept LLM outputs with a validation layer (like Pydantic) before execution to log telemetry and provide the model with exact error details for self-correction.
Data-Driven AI Health: Track specific metrics like Initial Schema Correctness Rate and Mean Attempts to Recovery to identify when your tool signatures, not your prompts, need simplification.
Combating Context Window Decay: For multi-step tasks, prevent agents from losing focus (Plan Adherence) by injecting global Trace IDs and aggressively pruning intermediate state to maintain prompt influence.

Let me paint a picture of an outage. It is three in the morning. An on-call engineer gets paged because a background worker loop is crashing. The logs show a simple text parsing error from an internal API. Someone shipped a bug. But when the engineer digs into the source of the crash, they realize a human did not write the request payload. An automated remediation agent generated it.

The agent recognized a memory leak in a staging cluster. It correctly deduced the need to restart the pods. It correctly chose the restart_kubernetes_deployment tool. But when it built the JSON payload, it hallucinated a field type. Instead of passing an integer for the timeout value, it passed a string: "three hundred seconds". The Go backend expected an int64 and immediately panicked. The agent received the panic stack trace, assumed the cluster was unreachable, and escalated to paging the human team.

This is the operational reality of building agentic systems. When you build chatbots, your primary metric is human vibe checks. Is the tone right? Did it summarize the document? But when you attach tools to an LLM, the entire game changes. You are no longer generating text. You are generating API payloads. An agent is fundamentally only as good as its strict adherence to those payloads.

If you cannot measure payload correctness, you do not actually have an autonomous agent. You just have a random string generator with production credentials.

The Two Dimensions of Failure

Most teams start monitoring their agents by logging the top-level task completion rate. If the user asks the agent to provision a Google Cloud Storage bucket and the bucket appears, the task is marked successful. This binary approach works during prototyping, but it completely breaks down in production.

A successful task completion hides exactly what happened during execution. An agent might have tried to call the Google Cloud API four times before finally guessing the correct resource format. If you execute complex tool chains, you are exposed to two entirely distinct failure domains that require separate metrics.

First, we have Schema Correctness. This simply answers the question of whether the agent built a payload that perfectly matches the required JSON Schema or OpenAPI specification. Did it include all mandatory fields? Did it respect the enums? Did it avoid making up imaginary parameters? A failure here means the tool cannot even be invoked. The backend rejects the request immediately.

Second, we have Logical Correctness. This is far more dangerous. The agent correctly constructs valid JSON. It respects the schema. It passes the backend validation checks. But the data it puts inside that valid JSON is fundamentally wrong for the business context. It provisions an n1-standard-4 machine instead of an A100 GPU instance because it failed to reason about the user request. Or worse, it passes default as the target namespace instead of production-api, causing changes in an unintended environment.

We are going to focus heavily on instrumenting the first domain, because if the schema validation fails, the logical execution never even begins.

Building the Observation Deck

You cannot fix what you cannot query. If you want to stop your large LLM agents from hallucinating JSON, you need to treat tool invocations exactly like microservice-to-microservice Remote Procedure Calls. You need distributed tracing, structured logs, and detailed error tracking.

We are going to construct a middleware layer that sits between the LLM output and the actual function execution. This layer has a single responsibility. It intercepts the raw JSON string proposed by the model, evaluates it against the Pydantic schema, records the results as structured telemetry, and then either executes the tool or returns the validation error back to the model for self-correction.

Let us look at how this plays out in code. We will use standard Python constructs to wrap our tools.

import json
import time
import logging
from typing import Any, Callable, Dict
from pydantic import BaseModel, ValidationError
from google.cloud import logging as cloud_logging

client = cloud_logging.Client()
logger = client.logger("agent-tool-execution")

def instrument_tool(tool_name: str, schema: type[BaseModel]) -> Callable:
    def decorator(func: Callable) -> Callable:
        def wrapper(raw_llm_payload: str) -> Dict[str, Any]:
            start_time = time.time()
            execution_id = generate_trace_id()

            log_entry = {
                "execution_id": execution_id,
                "tool_name": tool_name,
                "raw_payload": raw_llm_payload,
                "status": "pending"
            }

            try:
                # Attempt to parse the raw string into standard JSON
                parsed_json = json.loads(raw_llm_payload)
            except json.JSONDecodeError as e:
                log_entry["status"] = "json_parse_error"
                log_entry["error_detail"] = str(e)
                logger.log_struct(log_entry, severity="WARNING")
                return {"error": f"Invalid JSON format. Fix this error: {str(e)}"}

            try:
                # Attempt to validate JSON against the Pydantic schema
                validated_payload = schema(**parsed_json)
            except ValidationError as e:
                log_entry["status"] = "schema_validation_error"
                log_entry["error_detail"] = e.errors()
                logger.log_struct(log_entry, severity="WARNING")
                # Return exact validation errors to the LLM for self-correction
                return {"error": "Schema violation", "details": e.errors()}

            # If we reach here, the payload is structurally perfect
            try:
                log_entry["status"] = "executing"
                result = func(**validated_payload.model_dump())

                log_entry["status"] = "success"
                log_entry["duration_ms"] = (time.time() - start_time) * 1000
                logger.log_struct(log_entry, severity="INFO")
                return result

            except Exception as e:
                # The tool failed logically or due to an external API error
                log_entry["status"] = "execution_failure"
                log_entry["error_detail"] = str(e)
                logger.log_struct(log_entry, severity="ERROR")
                return {"error": "Tool execution failed", "details": str(e)}

        return wrapper
    return decorator

This snippet does something critical. It entirely decouples the evaluation of the payload from the execution of the business logic. It also normalizes our logging structure into Cloud Logging. Every single attempt an agent makes to use a tool is now indexed and searchable. We are no longer guessing why an agent chose a specific path.

Deriving the Golden Metrics

Once this structured telemetry begins flowing into Google Cloud Logging, you can easily route those logs via a sink directly into BigQuery. Now you have a relational database filled with every decision your AI agent ever made. You can start running SQL queries to measure system health.

What exactly should you be querying? I recommend tracking three specific metrics to determine the maturity of your agent architecture.

The first metric is Initial Schema Correctness Rate. This measures the percentage of tool invocations that passed schema validation on the very first attempt. If you notice a specific tool has a 40% initial correctness rate, you should not try to fix the LLM prompt. You need to fix the tool schema. The interface is too complex. If you have deeply nested JSON objects, flatten them. If you expect a string that must perfectly match a specific cloud region, change it to a strict Pydantic Enum. Design your tool signatures with empathy for the model. Make the easiest path the correct path.

The second metric is Mean Attempts to Recovery. When an agent fails schema validation and receives the error back, how many subsequent attempts does it take to finally format the JSON correctly? Advanced models excellent at zeroing in on Pydantic validation errors and correcting them instantly on the second try. If your Mean Attempts to Recovery is higher than two, your error messages are likely not descriptive enough. You are returning “Invalid Input” instead of telling the model exactly which field was missing. Provide the model with exact contextual feedback.

The final metric targets Logical Error Rates. These are the executions that passed validation but crashed the underlying function. The payload was a valid string, but the Google Cloud Storage bucket name contained uppercase letters. This indicates a gap in your schema descriptions. Pydantic allows you to inject descriptions into the JSON Schema. If a bucket name must be lowercase, write that rule directly into the field description so the model reads the constraint before generating the payload.

Plan Adherence and Tracing Trajectories

Getting individual tools to execute flawlessly is barely half the battle. Agents are designed to handle multi-step objectives. They formulate a plan, select a tool, observe the output, and decide on the next tool execution. This sequence is known as a trajectory.

A common failure mode in complex tasks is distraction. The agent successfully queries a database, but the result contains an interesting but entirely irrelevant anomaly. The agent forgets the initial user request, pivots its entire objective, and begins investigating the anomaly. It executes five more tools, summarizes the anomaly beautifully, and completely fails to complete the original task.

This lack of Plan Adherence is incredibly difficult to spot if you are only monitoring schema correctness. The tools executed perfectly. The payloads were valid. The agent simply walked off the job site and started working on a different building.

To measure Plan Adherence, your telemetry needs correlation identifiers. Notice how our earlier Python snippet generated a unique execution ID for the tool. That is not enough. You must generate a global “Trace ID” the moment the user submits their initial request. You must inject that Trace ID into the LLM context, and every subsequent tool invocation must be tagged with it.

By grouping tool invocations by Trace ID in BigQuery, you reconstruct the exact chain of thought. You can visualize this trajectory. You can see the agent call list_instances, then get_instance_details, and then inexplicably call delete_instance.

Evaluating these trajectories requires standardizing a “Golden Path” for common workflows. If the objective is “deploy this container to Vertex AI”, the Golden Path dictates a specific sequence. Check artifact registry. Verify image exists. Create Vertex Model resource. Create Vertex Endpoint. Deploy Model to Endpoint. While minor deviations are expected (perhaps the agent needs to retry a network timeout), significant deviations indicate a complete breakdown in reasoning.

You can operationalize this evaluation using Vertex AI Evaluation pipelines. You dump the completed application traces from BigQuery and feed them back into another LLM evaluator. You ask the evaluator model to score the trajectory against the defined Golden Path. Did the agent take unnecessary steps? Did it run tools in a destructive order? Did it successfully recover from expected failures without losing the overarching context?

Dealing With Context Window Decay

Another highly observable failure pattern in Plan Adherence emerges when dealing with extended trajectories. The first few tool calls might be perfect. The JSON payloads are pristine, and the logic aligns with the overarching objective. But as the agent loops through ten or fifteen steps, the context window fills with verbose tool responses, system logs, and intermediate reasoning.

This accumulation of state leads to a phenomenon I call Context Window Decay. The prompt begins to lose its strict formatting influence. The agent starts forgetting the specific JSON Schema it adhered to flawlessly just five minutes prior. Suddenly, it stops wrapping its variables in arrays or forgets crucial authentication tokens.

Your metric dashboards will show a distinct pattern. The Initial Schema Correctness Rate will look great for the first three steps of a trace but will plummet drastically on step twelve.

You cannot solve Context Window Decay by simply upgrading to a larger context window model. Supplying a two-million token window to Gemini 2.5 Pro gives you the capacity to store the history, but giving a model more noise often degrades its precision. You need to prune the state tree.

Implementation requires building sliding windows or summarization checkpoints into your agent loop. When a tool returns a massive JSON payload from a cloud API, do not append the raw response to the agent’s memory. Intercept the response, compress it to the strictly necessary fields, and append only the summary. Or better yet, instruct the agent to write intermediate state to a persistent memory store, like a Redis cache or a temporary Google Cloud Storage bucket, and only keep the reference URI in the active conversational context.

By strictly limiting the token count of the active trajectory, you preserve the weight of the system instructions at the top of the prompt. The model remains sharply focused on the schema definitions, significantly stabilizing long-running operations.

Shaping the Production Environment

We spend so much time discussing prompt engineering and model parameter sizing, but we ignore the actual bedrock of reliability. Software engineering disciplines do not vanish just because the primary actor is a neural network.

If an API payload is incorrect, the system goes down. It does not matter if a junior developer typed the payload or if a trillion-parameter model generated it. The backend server parsing the request lacks the capacity to care about the origin of the bytes. It only cares about the schema.

The shift toward agentic design requires practitioners to stop treating the LLM as a magical oracle and start treating it as another microservice in a highly asynchronous architecture. You wrap it in retries. You validate its outputs at the boundary exactly as you would validate untrusted user input. You pipe every decision downstream into an immutable ledger for operational auditing.

Building autonomous systems on Google Cloud is less about tuning the reasoning engine itself and more about heavily instrumenting the scaffolding around it. When your agents fail (and they will definitely fail), your metric dashboards must immediately illuminate the structural reason. If you cannot look at a Cloud Logging dashboard and instantly tell whether an agent hallucinated a payload or merely stumbled over a transient network timeout, you have more engineering to do. Give the model clear boundaries, validate its outputs mercilessly, and measure every single step of its trajectory.

Search

Measuring Tool Use Correctness & Plan Adherence

The Two Dimensions of Failure

Building the Observation Deck

Deriving the Golden Metrics

Plan Adherence and Tracing Trajectories

Dealing With Context Window Decay

Shaping the Production Environment

Related Posts

MCP Is Eating the AI Tooling Stack: Why Anthropic's Protocol Is the TCP/IP of Agentic AI

Agent Correctness: Evaluating Tool Use Errors and Hallucinations

LLMs are Terrible Backends: Forcing Strict JSON Output

Handling Context Window Limits in Multi-Agent Loops

The Two Dimensions of Failure

Building the Observation Deck

Deriving the Golden Metrics

Plan Adherence and Tracing Trajectories

Dealing With Context Window Decay

Shaping the Production Environment

Enjoying this insight?

Related Posts

MCP Is Eating the AI Tooling Stack: Why Anthropic's Protocol Is the TCP/IP of Agentic AI

Agent Correctness: Evaluating Tool Use Errors and Hallucinations

LLMs are Terrible Backends: Forcing Strict JSON Output

Handling Context Window Limits in Multi-Agent Loops

Strictly Necessary

Analytics