Essay · Feb 2026 · 9 min read

Why your LLM prompts are leaving 35% on the floor.

A short, opinionated case for treating prompts the way you treat any other piece of production code.

Most production prompts I see in the wild were written once, by one engineer, in about ninety minutes, on a Tuesday. They’ve been in production for fourteen months. Nobody has touched them because nothing is obviously wrong. The model returns plausible answers. The eval suite (if there is one) is green. Tickets close.

This is the steady-state of most LLM features in 2026, and it is leaving an enormous amount of quality on the floor.

The premise

The capability of an LLM-backed feature is bounded by two things: the model and the prompt. The industry has spent two years arguing about the model. The prompt has been treated as a configuration value — a string in a YAML file you change when somebody complains.

That’s the wrong abstraction. A prompt is a program. It has structure, it has dead branches, it has performance characteristics. It deserves the same scrutiny you’d give a function that runs ten thousand times an hour in production. Because it is one.

If a prompt is the function, an evaluation set is the test suite. You wouldn’t ship the function without the tests.

The four moves

Here’s the loop I run, in order, every time I’m asked to improve an LLM feature:

1. Build a real eval set before touching the prompt

Twenty to two hundred examples. Inputs you actually see. Golden outputs scored by a real human (or, fine, an LLM-as-judge with manual spot-checks). The temptation is to skip this step and just “try a few changes.” Don’t. Without a baseline, you can’t tell improvement from noise. You’ll spend a week tuning, ship it, and discover at the next outage you’re no better than where you started.

2. Score your current prompt against the set

This is the most morally clarifying number in the project. 72.4% on a 200-example reasoning set is a different problem than 91.6% on the same set. The first asks “how do I make the model do this at all,” the second asks “how do I make the model fail less weirdly on edge cases.”

3. Try eight rewrites, not one

The single most reliable improvement I’ve made on prompts is to stop writing one rewrite and start writing eight. Different strategies: chain-of-thought, role priming, format scaffolding, task decomposition, few-shot examples, negative examples, structured output, and a control. Score all eight. Pick the winner.

The winner is rarely the one I would have shipped if I’d only written one.

4. Ship the winner with the eval baked in

The eval set you built in step one becomes a regression suite. It runs in CI on every prompt change. It runs nightly against the live model in case the provider rolls a silent update. (They do. Constantly.) Without this you are flying blind, and the day will come when you regress 8% overnight and don’t notice for three weeks.

The 35% number

That’s the average accuracy lift across 14 tasks I tested while building Promptomize, a supervised model that automates this loop. Reasoning tasks averaged +28%. Structured extraction averaged +58%. Creative writing averaged closer to +9% — but creative writing eval is noisier than the others, so I’d treat that one with skepticism.

The point isn’t the specific number. The point is that the gap is consistently double-digit. If you have an LLM feature in production that hasn’t been through this loop, you almost certainly have a double-digit accuracy lift available to you, for the cost of a long afternoon.

What this isn’t

This isn’t about clever prompt-engineering tricks. The biggest wins I see come from structure — clear task definition, explicit output format, careful few-shot selection — not from magic phrases like “take a deep breath” or “you are a Stanford professor.” Those work sometimes. They work less than the YouTube videos suggest.

This also isn’t about switching models. Switching from a smaller to a larger model improves things, sure. So does fixing the prompt. The latter is free and doesn’t increase your inference bill 8x.

The takeaway

If you have an LLM feature in production and you can’t name the score it gets on an eval set you control, you have homework to do. The good news is the homework is small, the tooling is mature in 2026, and the upside is large.

The bad news is you’ve probably been shipping this feature for a year and a half and didn’t know it was bleeding 35% the whole time.

← All writing