LLMs are Terrible Backends (Unless You Force JSON)

The JSON Struggle

We have all been there. You spend hours tuning a prompt to get a clean JSON response, and 99 times out of 100, it works. Then, on the 100th time, the model decides to be “helpful”:

“Here is the JSON you requested: json { ... }”

Or worse, it adds a trailing comma that breaks your standard json.loads(). For a long time, we tried to fix this with regex and retry loops. But treating LLMs as text generators when we need data processors is a fundamental mismatch.

There is a way to force the model to behave. It requires moving from “Prompt Engineering” to “Schema Engineering.”

The High-Level Solution: Instructor & Pydantic

If you are using Python, the instructor library is the gold standard. It leverages Pydantic to treat LLM outputs as strictly typed objects, not loose strings.

Instead of hoping for JSON, you define the exact structure you want as a class. Additional instructions (like “Actionable task”) go into the type definition itself, not the prompt.

End-to-End Example: Meeting Analyzer

Here is a complete, runnable example that extracts structured meeting minutes from raw text. notice how we never mention “JSON” in the code.

import instructor
from pydantic import BaseModel, Field
from openai import OpenAI
from typing import List, Literal

# 1. Define your World as Types
# We want to extract a list of action items and a summary.
# The 'description' field acts as the prompt for that specific value.
class ActionItem(BaseModel):
    id: int
    description: str = Field(..., description="A clear, actionable task derived from the text")
    assignee: str = Field(..., description="Who is responsible? Use 'Unassigned' if unknown.")
    priority: Literal["High", "Medium", "Low"]

class MeetingAnalysis(BaseModel):
    topic: str
    attendees: List[str]
    action_items: List[ActionItem]
    summary: str = Field(..., description="A 2-sentence executive summary")

# 2. Patch the OpenAI Client
# This injects the logic to handle schema validation and automatic retries
client = instructor.from_openai(OpenAI())

# 3. The Function
def analyze_transcript(text: str) -> MeetingAnalysis:
    return client.chat.completions.create(
        model="gpt-4o",
        response_model=MeetingAnalysis, # <--- The Magic
        messages=[
            {"role": "system", "content": "You are a precise executive assistant."},
            {"role": "user", "content": f"Analyze this transcript: \n{text}"},
        ],
    )

# 4. Usage
transcript = """
Rajat: We need to ship the vector search feature by Friday.
Alice: I can handle the backend, but I need Bob to finish the API.
Bob: I'm swamped with the migration. I can't do it until next week.
Rajat: Okay, Bob, let's prioritize the migration. Alice, help Bob.
"""

result = analyze_transcript(transcript)

# 5. Result is ALREADY an Object
# No json.loads(), no string parsing.
print(f"Topic: {result.topic}")
for item in result.action_items:
    print(f"[Priority: {item.priority}] {item.assignee}: {item.description}")

Why this works:

Validation Loop: If the LLM generates a string for id instead of an integer, Pydantic raises a validation error. Instructor automatically sends that error back to the LLM so it can correct itself.
Type Hints: The LLM sees the schema definition. It knows priority can only be “High”, “Medium”, or “Low”.

The Low-Level Solution: Llama.cpp & Grammars

When running models locally (using llama.cpp or vLLM), we can go even deeper. We can control the Sampler itself.

A “Sampler” is the part of the engine that picks the next word. Usually, it picks based on probability. With Constrained Decoding, we force the probabilities of invalid tokens to zero.

Understanding GBNF

llama.cpp uses a format called GBNF (Grammar-Based Normalization Form). It’s like a regex for the entire output.

If you tell llama.cpp that the output must be an array of numbers, and the model tries to generate the word “Sure”, the engine sees that “Sure” doesn’t match the grammar [0-9, ]. It forces the probability of “S” to -infinity. The model literally cannot speak anything except the allowed grammar.

Example: Running with Grammar

You don’t usually write GBNF by hand. You convert a TypeScript interface or JSON Schema to GBNF.

Generate Grammar: Save this as json.gbnf (simplified):

root   ::= object
value  ::= object | array | string | number | boolean | null
string ::=  "\"" ( [^"\\] | "\\" (["\\/bfnrt] | "u" [0-9a-fA-F]{4}) )* "\""
...

Run Inference:

./llama-cli -m model.gguf -p "List the planets as JSON" --grammar-file json.gbnf

The output will be perfect JSON, 100% of the time. The model has no choice. It saves compute, too, because the model stops generating immediately after the closing } matches the root rule.

Why This Matters

Type Safety: Your code never crashes because of a missing key. The LLM cannot generate invalid JSON.
Latency: You save tokens. The model doesn’t output “Here is the result…” It just outputs {.
Security: You prevent “Prompt Injection” attacks that try to break out of the JSON structure.

If your LLM output isn’t compilable code, it’s just a hallucination waiting to break your prod build.

LLMs are Terrible Backends (Unless You Force JSON)

The JSON Struggle

The High-Level Solution: Instructor & Pydantic

End-to-End Example: Meeting Analyzer

The Low-Level Solution: Llama.cpp & Grammars

Understanding GBNF

Example: Running with Grammar

Why This Matters

Related Posts

Real-time Audio VAD: The Hardest Problem in Voice Agents

A2A Architectures: Tools are not just Functions (The Two-Phase Commit)

Squeezing the Inference Lever: The Economics of LLM Throughput

Debugging Audio Buffer Overruns: When Python Asyncio Drops the Ball

The JSON Struggle

The High-Level Solution: Instructor & Pydantic

End-to-End Example: Meeting Analyzer

The Low-Level Solution: Llama.cpp & Grammars

Understanding GBNF

Example: Running with Grammar

Why This Matters

Related Posts

Real-time Audio VAD: The Hardest Problem in Voice Agents

A2A Architectures: Tools are not just Functions (The Two-Phase Commit)

Squeezing the Inference Lever: The Economics of LLM Throughput

Debugging Audio Buffer Overruns: When Python Asyncio Drops the Ball

Strictly Necessary

Analytics