· Deep Tech · 8 min read
Beyond Vibe-Checks: Trajectory Evaluation & Synthetic Adversaries
Is your agent actually reasoning, or just lucky? Discover why trajectory analysis and synthetic red-teaming are the only ways to build production-grade autonomous systems.

There is a phrase currently haunting the engineering floors of every AI-first startup: “It feels like it’s working.”
In a traditional software stack, “feeling” like it is working is not a metric. We have unit tests for that. We have integration tests. We have p99 latency dashboards. But when you move into the world of autonomous agents—systems that can plan, call tools, and self-correct—traditional testing begins to feel hopelessly blunt.
If an agent successfully books a flight after five tool calls, did it perform well? Or did it wander through three redundant search queries, fail to parse a JSON response twice, and finally stumble into a success state?
The “vibe-check”—asking a human or even another LLM if the final answer “looks good”—is the primary bottleneck to production-grade AI. To move beyond the hype, we have to start evaluating the Trajectory, not just the destination.
The Failure of Outcome-Only Evaluation
Most teams start with what I call “Terminal Evaluation.” You give an agent a prompt, you collect the final string it produces, and you compare that string to a reference answer. Perhaps you use a metric like BERTScore, or you ask Gemini 2.5 Pro to rate it on a scale of 1 to 10.
Terminal Evaluation is dangerous because it hides the Toil.
Imagine an agent designed to resolve a customer support ticket. It has access to three tools: get_user_history, get_account_status, and issue_refund. In a successful run, the agent calls get_user_history, realizes the account is flagged, and declines the refund. In a “toilsome” run, the agent calls get_account_status twice, forgets to call the history tool, tries to call a non-existent help_me tool, receives a 404, and finally, after a broad reasoning loop, correctly declines the refund.
Both runs have the same “Terminal” result. Both would pass a vibe-check. But the second run is twice as expensive, takes three times longer, and is statistically more likely to hallucinate under high load.
If you are not evaluating the path, you are not evaluating the agent.
Introducing Trajectory Evaluation
Trajectory Evaluation is the process of treating the agent’s internal reasoning chain (the “Thought,” “Action,” “Observation” loop) as a first-class data object. We are not just looking for the final answer; we are looking for the architectural efficiency of the thought process.
To implement this on GCP, we start with a structured logging layer. Every time an agent (running perhaps as a Cloud Function or on GKE) invokes an LLM, we capture the Reasoning Trace. Using OpenTelemetry, we can span the entire trajectory, marking each tool call as a child event.
Once we have the data, we apply three specific Trajectory Metrics:
- Redundancy Ratio: The number of unique tool calls divided by the total number of tool calls. If this number is low, your agent is stuck in a loop.
- Information Gain Per Step: A measure of how much new context was added to the agent’s memory in each turn. If an observation returns a null set and the agent ignores that fact, the reasoning has stalled.
- Grounding Delta: The delta between the agent’s predicted plan and its actual execution. If the agent says “I will check the database” but actually calls a web search tool, you have a planning-to-execution drift.
Implementing Trajectory Extraction with OpenTelemetry
To make Trajectory Evaluation a reality, we must move beyond standard application logging. On Google Kubernetes Engine (GKE), we can leverage OpenTelemetry v1.x to create high-fidelity traces of agentic behavior.
Every step in an agent’s reasoning loop—the planning phase, the tool selection, the actual execution, and the observation parsing—should be wrapped in an OpenTelemetry span. These spans should include critical metadata in their attributes, such as:
agent.reasoning.step_id: The current index in the thought loop.agent.tool.name: The external function being invoked.agent.token_usage: The cost of this specific reasoning step.agent.feedback.loop_detected: A boolean flag if the judge detects a repetitive pattern.
By exporting these traces to Cloud Trace, we can visualize the entire “thought lifecycle.” We can see the exact moment an agent began to diverge from its plan. More importantly, we can build custom dashboards that alert us when the “Redundancy Ratio” across our production clusters spikes, indicating a widespread regression in our model’s reasoning stability.
Scoring the Resilience: The Judge Agent Rubric
A trajectory eval is only as good as the judge that scores it. While we want to avoid “vibe-checks,” we still need a high-level reasoning engine to analyze the traces. We use a larger context window model for this task and also because of its ability to reason over long sequences of structured logs.
The judge is fed the entire trace and a specific Scoring Rubric. This is not a fuzzy instruction; it is a rigid programmatic definition. We score the trajectory on four vectors:
- Logical Parsimony: Did the agent take the fewest number of steps necessary to achieve the objective? Points are deducted for every redundant tool call or self-correction that was not triggered by an external environment error.
- Observation Fidelity: Did the agent correctly interpret the results of its tool calls? If a tool returned a
File Not Founderror but the agent proceeded as if it had the data, the score is zeroed out. - Adversarial Recovery: In the presence of a Synthetic Adversary (which we injected into the environment), did the agent realize the obstruction and pivot? Success here is defined as moving from “Plan A” to distinct “Plan B” within two reasoning turns.
- Constraint Compliance: Did the agent stay within the predefined guardrails (e.g., token limits, tool-call depth)?
This scoring process is automated within our CI/CD pipeline. Every new version of our agentic logic is run against a “Gauntlet” of 500 adversarial trajectories. If the mean resilience score drops by more than 5%, the build is automatically rejected.
Case Study: The “Reasoning Loop” Failure
To see why this matters, let us look at a real-world failure case we observed in a financial reconciliation agent.
The agent was tasked with matching bank statements to internal invoices. It had two tools: list_bank_transactions and query_invoice_db.
On the surface, the agent was performing perfectly. It had a 98% success rate in matching records. But when we applied Trajectory Evaluation, we found a disturbing pattern. In about 20% of the cases, the agent was calling list_bank_transactions four times in a row without any changes to the input parameters.
It was “stuttering.” Because the terminal result (the match) was eventually correct, traditional testing never caught the waste. Each stutter was costing us $0.05 in unnecessary token fees and adding 4 seconds of latency. Across a million transactions, that is a $50,000 inefficiency that would have been invisible to a vibe-check.
Trajectory analysis identified the root cause: a minor prompt ambiguity that caused the agent to doubt its own memory of the tool output whenever the transaction list was longer than 50 items. We fixed the prompt, the stuttering vanished, and our latency dropped by 60%.
The Safety Valve: Deterministic Fallbacks
One of the most powerful insights from Trajectory Evaluation is knowing exactly when to “downshift.” At a certain point in a resilience test, specifically when an agent has entered a second loop or failed to recover from an adversarial mock twice, we implement a Deterministic Fallback.
This is traditional, hard-coded logic that takes over the task. It may not be as “elegant” or “innovative” as the agentic reasoning, but it is reliable. By measuring the “Distance to Success” in the trajectory, we can set a threshold: if the agent is not within a specific logic-branch by turn four, the system automatically evokes a standard Python function to complete the transaction.
This hybrid approach—Agentic for complexity, Deterministic for recovery—is the only way to maintain a five-nines service level agreement (SLA) in a production environment. You are not “giving up” on the AI; you are providing it with a safety net built from thirty years of proven software engineering.
A Final Note on Tool-Use Guardrails
As we build these trajectories, we are also building the guardrails. We’ve found that the best way to prevent trajectory drift is not to write a better prompt, but to constrain the tool-set dynamically.
Using Instructor v1.x or Pydantic v2.x to enforce rigid JSON schemas for tool inputs is step one. Step two is using the trajectory history to “mute” certain tools that are not relevant to the current state. If the agent is in a “Support” state, why does it still have access to the “Internal Database Deletion” tool? By pruning the tool-tree in real-time based on the current trace, we reduce the cognitive load on the model and significantly increase the parsimony of the final trajectory.
Conclusion: Engineering the “Thought”
We are at a transition point in the industry. The novelty of “the agent that can do things” is wearing off. The focus is shifting to “the agent that is reliable enough to run the business.”
You cannot build that reliability with a scaled-up version of manual QA. You build it by treating the “thought process” of your agent as code. You trace it, you stress-test it with adversaries, and you score its efficiency with deterministic science.
On GCP, we have the primitives—the TPUs, the GKE clusters, the Cloud Trace observability—to build these “Industrial-Grade Agents.” But the tools only matter if we have the discipline to use them.
The vibe-check is dead. Long live the trajectory.



