· Strategy · 8 min read
Beyond MMLU: The Shift to "Tool Correctness" Metrics
Why standard LLM benchmarks fail for agents, and how to measure real tool usage in production.

Let’s admit it, MMLU scores are the vanity metrics of the AI era. They are the digital equivalent of bragging about your SAT scores in your thirties. They tell you if a model is smart on paper. They do not tell you if it can execute a business workflow without setting your production environment on fire.
When we shift from prompt-and-response chatbots to autonomous agents, the entire evaluation criteria changes. You are no longer grading prose. You are grading execution. An agent is a system that writes its own code (or its own API payloads) to solve a problem.
If you cannot measure whether those self-generated payloads are correct, you do not have an autonomous agent. You just have a random string generator with production access.
The Mirage of “General” Benchmarks
Standard benchmarks like MMLU (Massive Multitask Language Understanding) evaluate a model’s ability to answer multiple-choice questions across a broad range of subjects. It tests if the model knows about biology, history, and basic computer science.
This is useful for foundational models. If a model fails MMLU, it probably lacks the basic reasoning capabilities to be a good agent. But if it passes, it does not mean it will succeed in your enterprise.
An enterprise agent is not answering trivia. It is querying internal databases, provisioning cloud resources, or updating CRM records. It is interacting with external APIs.
The failure mode for an agent is rarely that it fails to understand the concept of a “database.” The failure mode is that it forgets to wrap a string in quotes, or it hallucinates an extra parameter in a REST payload that causes a 400 Bad Request. Or worse, it passes a valid integer that happens to represent the wrong customer ID.
To evaluate agents, we need to move past “General” benchmarks and adopt “Tool Correctness” metrics.
Defining Tool Correctness
Tool Correctness is the measure of how accurately an agent uses external functions to achieve an objective. It breaks down into three distinct layers of verification.
1. Schema Correctness
This is the most basic layer. It answers a simple question: Did the agent construct a payload that matches the required schema?
If your tool expects parameters like customer_id (integer) and action (enum), and the agent passes customerId (string) and an open-ended text description, that is a schema failure. The backend will reject the request immediately. The execution never even begins.
Schema failures are annoying, but they are relatively safe. They trigger errors early, and modern models are excellent at self-correcting if you feed the validation error back to them.
2. Parameter Drift
Parameter drift occurs when an agent begins inventing its own schema fields over time. This is a subtle failure mode.
The agent might start execution correctly. It calls the tool with the correct parameters for the first three steps. But as the conversation history grows and the context window fills with logs and intermediate reasoning, the prompt begins to lose its strict formatting influence.
Suddenly, on step twelve, the agent decides to add an imaginary dry_run flag to the payload. It assumes the tool supports dry-run mode because it “feels” like it should. If your backend ignore excess fields, you might execute a destructive action when you thought you were testing it.
Measuring parameter drift requires tracking the delta between the proposed payload and the defined schema over time. If you notice accuracy dropping as the trace length increases, you are suffering from Context Window Decay.
3. Logical and Semantic Correctness
This is the most dangerous failure domain. The agent builds a valid JSON payload. It respects the schema. It passes all validation checks.
But the data inside the payload is wrong for the business context.
Imagine the agent is asked to “deactivate idle servers.” It correctly identifies that it should use the stop_instance tool. It builds a valid payload. But when it selects the instance_id, it grabs the ID of the primary production database instead of the staging test instance.
The tool executes perfectly. The server disappears. The system goes down. This is not a schema failure. This is a reasoning failure.
Measuring logical correctness is difficult because it requires context. You cannot verify it by simply checking a JSON schema. You must verify it against the state of the system at the time of execution.
Shifting Your Telemetry
To measure Tool Correctness at scale, you need to treat agent tool invocations exactly like microservice-to-microservice Remote Procedure Calls. You need structured logging and distributed tracing.
You should construct a middleware interceptor that sits between the LLM output and the actual function execution. This layer has a single responsibility: it intercepts the proposed payload, evaluates it against the Pydantic schema (or TypeScript types), records the results as structured telemetry, and then either executes the tool or returns the validation error back to the model for correction.
Let us look at how this plays out in code using TypeScript and standard types.
import { z } from 'zod';
interface ToolDefinition<T extends z.ZodTypeAny> {
name: string;
schema: T;
execute: (args: z.infer<T>) => Promise<any>;
}
interface ExecutionLog {
executionId: string;
toolName: string;
status: 'valid' | 'schema_violation' | 'execution_failure' | 'success';
rawPayload: any;
errorDetail?: string;
durationMs?: number;
}
export async function instrumentTool<T extends z.ZodTypeAny>(
tool: ToolDefinition<T>,
rawPayload: string,
executionId: string
): Promise<any> {
const startTime = Date.now();
const log: ExecutionLog = {
executionId,
toolName: tool.name,
rawPayload,
status: 'valid',
};
try {
// 1. Parse JSON
const parsed = JSON.parse(rawPayload);
// 2. Validate Schema using Zod
const validatedArgs = tool.schema.safeParse(parsed);
if (!validatedArgs.success) {
log.status = 'schema_violation';
log.errorDetail = validatedArgs.error.message;
await saveLogToBigQuery(log);
return { error: 'Schema violation', details: validatedArgs.error.errors };
}
// 3. Execute
const result = await tool.execute(validatedArgs.data);
log.status = 'success';
log.durationMs = Date.now() - startTime;
await saveLogToBigQuery(log);
return result;
} catch (e: any) {
log.status = 'execution_failure';
log.errorDetail = e.message;
await saveLogToBigQuery(log);
return { error: 'Execution failed', details: e.message };
}
}
async function saveLogToBigQuery(ExecutionLog log) {
// Implementation to stream telemetry to BigQuery
console.log("[Telemetry Tracking]:", log);
}By decoupling the tool evaluation from the execution, we build an immutable ledger of every decision our agent proposes. We normalize our telemetry. Every single attempt an agent makes to use a tool is now indexed and searchable.
Derived Metrics for Continuous Evaluation
Once this telemetry begins flowing into your analytical database, you can run queries to measure actual system health. I recommend tracking these three metrics to evaluate your agent architecture.
1. Initial Schema Correctness Rate
Percentage of tool invocations that passed schema validation on the very first attempt.
If you find a specific tool has a low initial correctness rate, the problem is likely not the LLM. The problem is your UI design. The interface is too complex. If you have deeply nested objects, flatten them. If you expect a string that must match a cloud region, change it to a strict TypeScript Enum. Design your tool signatures with empathy for the model. Make the easiest path the correct path.
2. Mean Attempts to Recovery
When an agent fails schema validation and receives the error back, how many subsequent attempts does it take to finally format the JSON correctly?
Modern frontier models are excellent at zeroing in on validation errors and correcting them instantly on the second try. If your Mean Attempts to Recovery is higher than two, your error messages are probably too generic. You are returning “Invalid Input” instead of telling the model “The field timeout expects an integer between 1 and 300, but you passed a string.” Provide the model with exact contextual feedback.
3. Logical Error Rates vs. Success Rates
These are the executions that passed schema validation but crashed the underlying function. The payload was valid syntax, but the business logic rejected it.
This indicates a gap in your schema descriptions. Modern type systems allow you to inject descriptions into the generated JSON Schema. If a file name must be lowercase alphanumeric, write that rule directly into the field description. The model will read this constraint before generating the payload.
Shaping the Production Environment
We spend so much time discussing prompt engineering and model parameter sizing, but we ignore the actual bedrock of reliability. Software engineering disciplines do not vanish just because the primary actor is a neural network.
If an API payload is incorrect, the system goes down. Digital services do not care about the origin of bytes. They only care about the schema.
The shift toward agentic design requires practitioners to stop treating the LLM as a magical oracle and start treating it as another microservice in a highly asynchronous architecture. You wrap it in retries. You validate its outputs at the boundary exactly as you would validate untrusted user input. You pipe every decision downstream into an immutable ledger for operational auditing.
Building autonomous systems on Google Cloud is less about tuning the reasoning engine itself and more about heavily instrumenting the scaffolding around it. When your agents fail (and they will definitely fail), your metric dashboards must immediately illuminate the structural reason. Move past MMLU. Give the model clear boundaries, validate its outputs mercilessly, and measure every single step of its tool execution loop.



