Generative AI · Production · Cost Control · Structured Outputs

The structured output schema that doubled our token bill

Published · Updated

Forcing LLMs into strict JSON schemas feels safe, but the hidden token cost of schema repetition and refusal loops will quietly destroy your unit economics. Here is how we measured and fixed it.

We moved our generative pipeline to structured outputs to get clean JSON for downstream automation. It worked, until we audited the token usage and found our bill had doubled. The model was spending half its output budget repeating key names and escaping syntax instead of generating actual value.

The schema tax in production

When you demand a strict schema with 30 fields, the model must emit every key and bracket on every call. We measured an average of 140 output tokens per response dedicated purely to JSON structure. At scale, you pay for structure that your application code could easily inject.

Worse, strict schemas increase refusal rates. When the model lacks data for a required field, it often halts or outputs null arrays. Our error logs showed a 12% spike in schema validation failures compared to our previous free-text runs. The model was bending itself into knots to fit the mould.

The two-step extraction refactor

We decoupled generation from structuring. First, the LLM produces a concise free-text block containing only the variable data. This reduced the average output from 310 tokens to 90 tokens. The model writes naturally, which yields higher quality extractions and fewer hallucinated fillers.

Second, a deterministic Python script parses that text into the required JSON schema. If a field is missing, the script assigns a null value. If the format breaks, we retry the cheap text generation. This shift moved our failure handling from expensive LLM retries to trivial code retries.

When structured output is worth the cost

There are genuine cases where you must pay the schema tax. If you need the model to perform complex nested classification, forcing a specific enum can constrain the output space and improve accuracy. We kept strict schemas for our routing agent, where a 40-token overhead was negligible.

The trade-off breaks down at volume. For our content generation pipeline processing 50,000 items daily, the schema tax added roughly 7 million surplus tokens per day. At our pricing tier, that was an extra 420 euros daily for the privilege of the model doing basic serialization work.

Measuring your own schema overhead

Run a simple A/B test on 1,000 production prompts. Send half to your strict schema endpoint, and half to a free-text endpoint with identical instructions. Compare the total output tokens, the latency, and the error rates. You will quickly see where the schema is helping versus where it is just expensive formatting.

Pragmatic takeaway: stop asking the LLM to be a JSON serializer. Generate free-text or minimal markdown, then use code to wrap it into the schema your database expects. Reserve strict structured outputs for classification tasks where the schema constraints genuinely improve reasoning, not just data formatting.

Your unit economics will improve, your latency will drop, and your failure rates will shrink because the model is no longer fighting your syntax when it should be focusing on its output.

Working on a project where these methods apply?