Search

· Strategy  · 12 min read

The True Cost of Agentic Workflows

Hidden compute and API costs accumulate fast when deploying autonomous agent loops in production. A candid look at the real economics of agentic workloads.

Featured image for: The True Cost of Agentic Workflows
Key Takeaways
  • Agents multiply token consumption exponentially: A single task that takes 3 API calls in traditional automation can require 20 to 80 calls in an agentic workflow, once you add reasoning, self-correction, tool discovery, and retry loops.
  • The hidden cost is not the model call, it is the control loop: Most cost models budget for the LLM output. They forget the orchestrator, the intermediate tool outputs, the embedding cache lookups, and the supervisor agent that re-evaluates each step.
  • You cannot run agents at pilot scale and expect production economics: What looks cheap in a demo disappears when you scale to ten thousand concurrent users, because every agent turn burns tokens, every retry burns more, and most of those burnings produce nothing of value for the customer.

I spent the better part of this year helping three different companies figure out why their agentic workflows are bleeding money. Each one started the same way. The leadership team had seen a compelling demo, maybe at a conference or a vendor pitch, and they approved a production budget. The prototype worked beautifully in the sandbox. A single user, a single task, running on a small batch of data with generous rate limits.

Then they turned it loose.

Within a week, their API costs had gone up by a factor of forty. Within a month, they had to pull the plug.

This is not a story about agents being bad technology. Agentic systems are real and useful. The problem is simpler and more boring. Nobody does the math on the control loop.

What Nobody Budgets For

When you design a traditional automation, you know the cost up front. You call an API. It returns a result. You call another API. Done. Three calls. You can price that to the cent.

An agentic workflow is different. The model does not just produce the answer. It has to figure out what to do, call a tool, read the result, decide what to do next, call another tool, read that result, and so on. Each of those steps burns tokens. Not just the final output tokens. The input tokens too. You have to feed the model the full conversation history, the tool results, the system prompt, the instruction, the constraints.

Let me walk you through a concrete example. A customer service agent that handles a refund request.

The human says: “I returned this item three weeks ago and have not received my refund.”

In a traditional system, you have a rule. If the item was returned and the window is within 30 days, approve. You hit the payment API, you hit the inventory API, you send a confirmation. Two API calls. Maybe $0.03 in model cost for the intent classification step.

In an agentic system, the model reads the message. It reasons about what might be going on. It searches a knowledge base for the company’s return policy. It calls an inventory lookup tool. The tool returns a result, which gets fed back into the conversation. The model reasons again, realizes the customer used a different email address for the return, and calls the database with the corrected email. That database call goes through. The model now has enough information to decide. It calls the payment API tool. The API returns success. The model writes a response to send back to the customer.

Let me count. That is roughly 12 LLM calls. Minimum. Before you factor in the self-correction step where the model checks its own work, or the tool that rates the quality of the database lookup, or the supervisor agent that verifies the refund amount matches policy.

Each LLM call is not just the final answer tokens. It is the prompt tokens, which include the growing conversation history, the system instructions, the tool definitions, the schema descriptions. Those prompt tokens are where the real cost lives.

The math is simple. Twelve calls with an average of 3,000 prompt tokens and 400 output tokens per call, on a model that charges 3permillionprompttokensand3 per million prompt tokens and15 per million output tokens. That is 0.21percustomerinteraction.Inthetraditionalsystem,itwas0.21 per customer interaction. In the traditional system, it was0.03.

Seven times more expensive to solve the same problem.

Now multiply that by one thousand customers per day. You are burning $6,300 a day on model calls. Thirty thousand a month. You will start hearing from the CFO.

The Cost Architecture, Visualized

The cost difference between linear automation and agentic inference comes down to call multiplicity. Here is what a real interaction looks like on both sides:

Each extra reasoning step, each tool call, each embedding lookup, each self-correction pass multiplies your token burn. The linear path is predictable and cheap. The agentic path is powerful and expensive. The math is the math.

The Compounding Problem of Retry Loops

Here is the part that makes agentic economics really ugly.

LLMs make mistakes. All of them. Not catastrophic mistakes, usually. Just small ones. The model calls the inventory API with the wrong product ID. The database returns empty because of a case-sensitive field the model missed. The model calls the tool again with slightly different parameters.

Your agent detects the empty response. It reasons about why it failed. It tries again. This is what you want, right? A smart system that handles errors gracefully. You built it in.

What you did not build in is a cap on how many times the agent will retry.

I saw one production system where a customer support agent got stuck in a loop with a partially-documented inventory API. The API returned a non-standard error code that the agent had never seen. Instead of escalating to a human, the agent kept retrying with different interpretations of the error. Forty-two retry calls. Forty-two API burns. On a single customer ticket.

That one ticket cost 1.80inmodelcalls.Theaverageticketcost1.80 in model calls. The average ticket cost0.21. This one ticket was eight times the average, and it happened three times a day.

The solution is not cleverness. It is discipline. Every agentic system needs a strict maximum turn budget. Set it at design time, based on the worst acceptable cost per user interaction. If the agent cannot solve the task within that budget, it escalates. No exceptions.

But here is the thing about budgeting. You are now running a financial system around your AI system. You need counters, you need alerting, you need dashboards showing token consumption by agent, by task, by failure rate. You are not just an AI team anymore. You are a FinOps team.

The Orchestrator Tax

There is another cost that almost nobody tracks. The orchestrator.

An agent does not call the LLM. An orchestrator calls the LLM. And the orchestrator has to do work too. It has to manage the conversation state. It has to serialize and deserialize tool results. It has to track which tools have been called and which have not. It has to handle timeouts, retries, rate limit backoff. It has to maintain the tool registry. It has to log everything for audit purposes.

This sounds like infrastructure overhead. It is not. It has real cost implications.

The orchestrator adds latency to every call. That extra latency means each worker node handles fewer requests per second. That means you need more worker nodes. That means more infrastructure cost.

Worse, the orchestrator has to maintain conversation state for every active agent session. In a traditional web service, session state is simple. A user ID. A token. Maybe a small cache entry. In an agentic system, the orchestrator maintains the full conversation history for every agent that is currently thinking. A hundred concurrent agents, each with 20 tool calls in their history, each averaging 4,000 tokens. That is 8 million tokens living in memory, just waiting for the next instruction.

For context, that is roughly the equivalent of storing 8,000 pages of text. In RAM. For as long as the agents are running.

At any meaningful scale, you cannot just throw memory at this problem. You need stateful orchestration. You need checkpointing. You need the ability to persist and restore agent state without losing the conversation. And all of that adds infrastructure layers that a pilot project never planned for.

Embedding Costs Nobody Talks About

Agentic systems don’t just call LLMs. They embed things.

When an agent needs to check a knowledge base, it typically creates an embedding of the user query, searches a vector index, and feeds the top results back into the prompt. Each embedding is a smaller model call. The embeddings themselves cost money. And they are not free.

An embedding model that runs at $0.02 per thousand tokens might seem negligible. But your agent embeds every tool result, every policy document it loads, every user message context snippet. The embedding calls stack up. And they accumulate linearly with every interaction.

I worked with one team that had a completely separate embedding cost budget that was growing 30 percent month over month. They had not planned for it because their architecture diagram only showed the primary LLM calling tokens. The embedding layer was invisible until the invoice arrived.

The Real Comparison: Pilot vs Production

The disconnect between pilot and production economics comes from a simple asymmetry. Pilots are designed to look good. Producers have to look at the invoice.

In a pilot, you have:

  • One user running the workflow once
  • A small, curated dataset
  • No concurrent sessions
  • No retry storms
  • A team that manually fixes edge cases

In production, you have:

  • Thousands of concurrent users
  • Dirty, unpredictable data
  • Concurrent agent sessions competing for rate limits
  • Failure patterns you never anticipated
  • A customer service team that escalates when the agent fails

The cost ratio between these two worlds is massive. And nobody budgets for the delta because nobody expects the delta.

When I advise boards and executives on agentic investments, I ask one question: what is the token cost per resolved work item at scale? Not the cost of the demo. The cost when fifty agents are running simultaneously on messy data, and ten percent of them have to retry, and fifteen percent will never succeed without human escalation.

Most leadership teams can’t answer that question. They have a budget for an LLM subscription and a hope that things will work out.

Building Cost Control Into the Architecture

The companies that make this work are not the ones with the most sophisticated agents. They are the ones that build cost control into the architecture from day one.

The first move is task decomposition. Don’t let a single agent handle a multi-step workflow. Break it down. Each step has a bounded cost. If step three fails, you only burned cost on steps one and two, not on a runaway full workflow.

The second move is caching at every level. If two different users ask the agent the same question about the return policy, the response should come from cache, not from fresh model calls. I’ve seen teams save 40 percent of their token budget by implementing semantic caching for repeated queries.

The third move is model stratification. Not every step needs GPT-4. The intent classification? A smaller model. The tool selection? Another small model. Only the final reasoning step needs the biggest model. Most agentic systems use one model for everything. This is wildly inefficient.

The fourth move is hard budgets with graceful degradation. Every agent has a turn budget. Every workflow has a cost budget. When the budget is hit, the system falls back to a cheaper path or escalates to a human. This is not a failure mode. It is the design.

The fifth move is observability as a first-class concern. You are running a probabilistic system at scale. You need to see every call, every token count, every latency, every failure. Without that visibility, you are flying blind and your invoice is the only instrument you have.

The Bottom Line

Agentic workflows are not free. They are not even close to free at scale.

The fundamental insight you need to internalize is this: each additional reasoning step an agent takes multiplies your token consumption. Each additional tool call multiplies your prompt size. Each retry multiplies your cost by a factor you may not have planned for.

This does not mean you should not build agentic systems. It means you should build them with economics as a first-class constraint, not as an afterthought. Budget for the control loop, not just the model. Budget for retries, not just the happy path. Budget for orchestration overhead, not just model calls.

If you do the math correctly upfront, most agentic workflows do work out. The savings in operational efficiency are real. The ability to handle complex workflows that would require multiple human operators is real. But the math has to be right, and it has to be right before you deploy.

The companies that get this wrong pay a tax. Sometimes a big one. The question is not whether agentic systems are worth it. The question is whether you have done the accounting.

Back to Blog

Related Posts

View All Posts »