LLMs + Verification = Magic

I had a magical experience at work today!

My team inherited a large AI system that wasn’t performing well. The router, which was responsible for directing questions to the correct sources, had a massive prompt (thousands of lines long) that had multiple responsibilities and was tightly coupled to gpt-4o. We tried to make surgical improvements to the prompt but didn’t get anywhere. Using new models made the system perform worse. So, we took a step back, spent about a week decomposing the prompt into multiple prompts each with a single responsibility and built real evals so we could have a way to verify any changes.

With that setup in place, I was able to give Claude Code a goal (e.g. iterate on the prompt until the routing accuracy is > 98% as verified by the eval suite) and let it work. After about an hour it reached 99% accuracy and removed hundreds of lines of prompt code. It just… worked 🤯.

I’ve understood and seen the value of giving Claude de a way to verify it’s work but this experience felt different. Making our eval suite fast (~11 minutes for the full suite) was a real game changer. Our first pass at it took ~4 hours to run which was too slow to unlock any value.