Search

· Agentic AI  · 13 min read

LLMs are Terrible Backends: Forcing Strict JSON Output

When you use LLMs as API endpoints, their probabilistic nature breaks downstream systems. Here is how to enforce strict JSON output through grammar-constrained generation and structured outputs.

Featured image for: LLMs are Terrible Backends: Forcing Strict JSON Output
Key Takeaways
  • LLMs were designed to be creative, not deterministic: Their probabilistic token generation makes them fundamentally unsuited for structured data output unless you constrain the generation space.
  • Prompt Engineering for JSON is brittle: Even sophisticated system prompts with "your output must be valid JSON" provide no guarantee. The model will occasionally skip a closing brace, use single quotes, or append a markdown code fence that breaks your parser.
  • Grammar-constrained generation is the only reliable approach: By constraining the token sampling to a formal grammar (JSON grammar, custom JSON Schema, or regex), you eliminate the possibility of structural errors at the cost of negligible latency overhead.

I learned this lesson the hard way, which of course is the only way.

I was building an agent tool-calling system. The agent calls an API tool, generates the parameters, the system parses them, and executes the tool. Standard architecture. Everyone does this.

The tool schema defined a JSON object with three required fields. The model would output the JSON, the system would parse it, and the tool would execute. Simple.

In my testing, roughly 7 percent of the JSON outputs were structurally invalid. Not the kind of invalid that a parser could recover from. The kind of invalid that crashes the execution loop, drops the agent’s context, and forces a restart.

Seven percent. One in fourteen tool calls. The agent was silently failing more often than it was succeeding.

A seven percent error rate would get any backend engineer fired. In an LLM system, it is just Tuesday.

Why LLM Outputs Break

Here is what an LLM actually does when you ask it to output JSON.

It does not.

An LLM does not output data structures. It does not know or care about JSON. It outputs tokens, one at a time, where each token is chosen from a probability distribution. The system prompt says “output JSON.” The model generates the token {" because that was statistically likely given the prompt. Then it generates "field": " and continues.

If the training data contains millions of JSON documents (and it does, GitHub, Stack Overflow, API documentation, configuration files), the model has learned to generate JSON-like output. It is really good at it. It is good enough that most people ship it to production and do not think about it.

But it is not deterministic. The model does not have a JSON parser. It has never run json.load() in its life. It is predicting the next likely token, and occasionally it predicts a token that produces structurally invalid output.

What does invalid output look like? I have seen every flavor:

  • A trailing comma at the end of the last array element, which is valid in a lot of programming languages but not in strict JSON
  • A missing closing brace, which makes the parser read forever
  • Single quotes instead of double quotes
  • Markdown code fence wrapping: json ... that your parser does not strip
  • Comments inside the JSON, which are not valid JSON according to the spec
  • The key null instead of the string "null", which produces a JSON null value where a string was expected
  • A value that is undefined instead of null, or worse, just the word undefined floating in the middle of the object

Most of these are trivial for a human to spot and fix. None of them are trivial for an automated parser to handle without crashing.

The Prompt Engineering Trap

Your first instinct is to write a better system prompt. More words. More emphasis. More instructions about JSON validity.

You MUST output valid JSON. Your output must be parseable by json.loads().
Every open brace must have a matching close brace. Every string must use double quotes.
Do not include any text outside of the JSON object.

This helps. A lot. When you add these instructions, the invalid output rate drops from 7 percent to maybe 1.5 percent. Better, but not acceptable for production.

The problem is that prompt engineering is a probabilistic intervention applied to a probabilistic system. You are asking the model to be deterministic about something that is fundamentally stochastic. It works most of the time. When it does not, it fails in ways that are subtle and hard to detect.

I spent six months trying to prompt engineer my way out of JSON validity issues. Six months. I tried chain-of-thought prompting before the JSON output. I tried providing few-shot examples. I tried temperature zero. I tried structured output formatting. I tried asking the model to validate its own JSON before outputting it.

None of these approaches eliminated the error rate. They all reduced it, sometimes dramatically, but never to zero.

The fundamental problem is that asking an LLM to validate its own output does not give it the ability to parse and validate JSON. It just asks it to generate a validation statement. An LLM can say “My JSON output is valid” while generating JSON that is not.

Grammar-Constrained Generation

The correct solution is not to ask the model to be more careful. It is to change what tokens the model is allowed to generate.

Grammar-constrained generation, also called constrained decoding or structured output, works by restricting the token sampling process to tokens that are consistent with a formal grammar. You give the constrainer a grammar (JSON grammar, your custom schema, a regex pattern) and it modifies the model’s next-token probabilities to zero out any tokens that would violate the grammar.

The result is deterministic output that matches the grammar. Every time. Zero structural errors.

How does this work in practice? Let me walk through it.

When the model generates the first token {, the grammar constrainer checks: is { a valid start of a JSON object according to the grammar? Yes. The token is allowed. The constrainer then updates its internal state to expect a JSON string key next (because JSON objects start with {"key": value}).

When the model generates the next token ", the constrainer checks: is this a valid continuation given that we are expecting the start of a JSON string? Yes. The constrainer enters “string parsing” mode and allows any character that is valid inside a double-quoted JSON string.

When the model generates "field", the constrainer checks: does this string produce a valid key? Yes. Then it checks: what comes after a valid JSON string key? A colon. The next token generation is constrained to only the : token. The model cannot output "field": followed by whitespace or a newline. It must output : immediately.

This constraint propagation continues through the entire output. At each step, the constrainer maintains the grammar state and eliminates any tokens that would produce a structurally invalid JSON path.

Here is what the token-by-token constraining looks like for a simple JSON object:

The constrainer maintains a state machine derived from the JSON grammar. At every single token boundary, it checks: is this token valid given the current state? If the model wants to produce a token that would violate the grammar, the constrainer sets its probability to zero before sampling happens. You are not checking whether the output is valid JSON after the fact. You are ensuring that it is impossible to produce invalid JSON in the first place.

The overhead is small. On modern hardware, the grammar constrainer adds roughly 3 to 5 milliseconds of latency per token. For a typical tool call that generates 50 tokens, that is 150 to 250 milliseconds of overhead. Almost nothing in the grand scheme of a model inference call.

Implementation Approaches

There are three main approaches to grammar-constrained generation, each with trade-offs:

Regex-Guided Sampling

This is the simplest approach. You provide a regex pattern that describes the valid output format, and the constrainer eliminates tokens that would violate the pattern. For JSON output, you provide a regex that matches valid JSON strings.

The implementation is straightforward. You maintain a regex NFA (nondeterministic finite automaton) state as each token is generated. After generating each token, you check which tokens would advance the NFA into a valid state. You zero out the probabilities of all tokens that would advance the NFA into an invalid state.

Regex guidance works well for simple output formats. It is fast and has minimal overhead. But it has a fundamental limitation: regex cannot express the full JSON grammar. JSON has nested structures (objects within objects, arrays within arrays) that require a context-free grammar, not a regular grammar. Regex-guided sampling can handle simple JSON objects but will fail on nested structures.

JSON Schema Constrained Decoding

This approach uses a JSON Schema (the same format produced by most tool-call APIs) as the constraining grammar. The constrainer maintains a tree of valid JSON types based on the schema, and at each step, it narrows down which tokens are valid based on the current position in the schema tree.

This is more powerful than regex guidance because it understands nested structures. It knows that inside an object, you need a string key, then a colon, then a value of the type specified by the schema. It knows that inside an array, you can have any value of that array’s type.

The implementation is more complex, and the computational overhead is higher. The schema needs to be compiled into a trie or DFA structure, and the constrainer needs to traverse that structure as each token is generated.

However, this is the approach that most production systems use. It handles the full JSON grammar, including nested objects and arrays, arbitrary depth, and all JSON data types.

Grammar-Aware Decoding Libraries

There are now several open-source libraries that implement grammar-aware decoding:

  • Outlines (by dify.AI) is one of the most mature and widely used. It supports JSON schema constrained generation, regex-guided generation, and Lark grammar parsing. It integrates with HuggingFace transformers, vLLM, and other inference backends.

  • LMFormatEnforcer provides both grammar and regex-constrained generation with a focus on low overhead. It works with OpenAI-compatible APIs, vLLM, and llama.cpp.

  • guidance (by Microsoft) is a higher-level framework that lets you specify output structure using a combination of Python code and grammar definitions. It translates the Python constraints into the equivalent grammar constraints.

These libraries differ in their performance characteristics, grammar coverage, and integration requirements, but they all solve the same fundamental problem: generating deterministic, structurally valid output from a probabilistic model.

The Performance Trade-Off

There is a real performance cost to grammar-constrained generation, and it is worth understanding before you commit to it.

The constrainer runs on every new token. After the model outputs its logits for the next token, the constrainer zeroes out the probabilities of invalid tokens. The logits are then renormalized and sampled. This adds compute to every token-generation step.

In my benchmarks:

JSON Schema Constrained Decoding adds approximately 2x the per-token latency compared to unconstrained generation. If unconstrained generation produces a token every 5 milliseconds, constrained generation takes about 10 milliseconds per token. For a 100-token output, the difference is roughly 500 milliseconds.

Regex-Guided Sampling adds significantly less overhead. The regex NFA transitions are computationally cheaper than compiling and traversing a schema-derived DFA. The per-token overhead is roughly 1.2 to 1.3x, adding 1 to 2 milliseconds to each token’s generation time. For a 100-token output, the difference is roughly 100 milliseconds.

These numbers are for single-token generation. In a continuous batching inference server, the overhead is amortized across batches because the constrainer runs once per batch step, not once per token in the batch. The throughput impact on a server running multiple concurrent requests is minimal.

The question is always: is 2x per-token latency worth zero validation errors? For most production systems, the answer is yes. A validation callback that fails every 14 requests is worse than 500 milliseconds of extra latency on a single request.

Integration Patterns

The easiest integration point is at the API request level. Most LLM providers now support a parameter that activates constrained generation.

With OpenAI’s API, you use the response_format parameter with a JSON schema:

{
  "response_format": {
    "type": "json_schema",
    "json_schema": {
      "strict": true,
      "schema": {
        "type": "object",
        "required": ["field_a", "field_b"],
        "properties": {
          "field_a": {"type": "string"},
          "field_b": {"type": "integer"}
        }
      }
    }
  }
}

This tells the OpenAI backend to activate grammar-constrained decoding using the provided JSON schema. The output is guaranteed to match the schema. The server handles all constraining internally.

For self-hosted inference (vLLM, TGI, TensorRT-LLM), you typically integrate a library like Outlines or LMFormatEnforcer as a post-processing step that constrains the logits before sampling.

Here is a minimal example with vLLM and Outlines:

import outlines
import outlines.generate as generate
from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")

schema = {
    "type": "object",
    "properties": {
        "action": {"type": "string", "enum": ["search", "read", "write"]},
        "resource": {"type": "string"},
        "params": {"type": "object"}
    },
    "required": ["action", "resource"]
}

sampling_params = SamplingParams(temperature=0.0, max_tokens=256)

# outlines applies grammar constraints to the sampler
generator = generate.json(llm, schema=schema, sampling_params=sampling_params)

result = generator("You are a system that outputs action plans. Generate an action plan for searching the user database for the email '[email protected]'.")
# result is a deterministic Python object matching the schema
# no JSON parsing needed, no validation callback needed

The key thing to notice here: the output is a parsed Python object. Not a JSON string that you have to parse. The constrained generation + parsing pipeline gives you both structural correctness and type safety in a single step.

What This Means for Agent Architecture

If you are building an agent system that calls tools or APIs, you need strict JSON output. Every tool call is a JSON payload. Every structured response from a downstream system is JSON. Every state update the agent needs to make is a JSON object.

The seven percent error rate I saw in my first system was not a minor edge case. It was structural. The agent was silently failing at the basic mechanism of tool communication. Every tool call had a one-in-fifteen chance of producing garbage that would crash the parser.

Grammar-constrained generation eliminates that error rate. It turns the LLM from a probabilistic output generator into a deterministic data producer for your structured outputs.

The overhead is real but bounded. The implementation requires a library and some integration effort. There is no way around either of those things, and that is fine. This is the cost of treating an LLM as a backend component rather than a chat interface.

LLMs are terrible backends. They are brilliant natural language processors, excellent pattern recognizers, surprisingly creative reasoners. But they are terrible at producing deterministic, structured data.

Forcing them to produce strict JSON through grammar-constrained decoding is not a hack. It is the correct engineering approach for a probabilistic system that needs to interface with deterministic infrastructure.

You would never build a web application by sending unvalidated HTML from a user form directly to a database. You would not parse XML from an external API without a schema validator. Same principle applies here.

Back to Blog

Related Posts

View All Posts »