The problem
Most teams using an LLM in production are leaving accuracy on the table because the prompts they ship were written by a human, once, in a rush. The model is capable of more — but the prompt is the ceiling. Existing tooling helps you track prompts. Almost nothing helps you improve them.
The product
Promptomize is a supervised model trained on 18,000 (input, weak prompt, strong prompt, eval-score) tuples. Given a frozen base prompt and a small held-out evaluation set, it rewrites the prompt and reports the projected lift before you ever push it to production.
How it works
- Ingestion. You provide a base prompt and 20–200 evaluation examples (input + golden output).
- Rewriting. The Promptomize model produces 8 candidate rewrites with diverse strategies — chain-of-thought, role priming, format scaffolding, decomposition.
- Evaluation. Each candidate is scored against your held-out set on the target backend (GPT, Claude, Gemini, open-weight).
- Selection. The best variant is returned with a confidence interval and a delta against your baseline.
Results
Across our internal benchmark suite — 14 tasks spanning reasoning, code generation, classification, and structured extraction — Promptomize lifts accuracy by an average of 35% versus the human baseline. The largest wins are on structured-extraction tasks (+58% F1 on financial-table extraction). The smallest wins are on creative writing, where evaluation is noisier.
Stack
- Training: PyTorch, Lightning, DeepSpeed on 8×A100.
- Backend: Modal for GPU rewrite endpoints, Postgres for tuples & evals, Redis for the rewrite cache.
- Frontend: Next.js 15 App Router, React Server Components, Tailwind, shadcn/ui.
- Eval harness: a custom multi-backend runner with deterministic sampling and per-task scoring.
Status
Promptomize is in private beta with 14 paying teams. The hosted product is closed-loop with a self-hosted CLI on the way for teams that need to keep prompts and evals on-prem. Expected GA: Q3 2026.
