While building Conductor we’ve often struggled to make sure we haven’t introduced regressions when changing system prompts, etc. Manually testing sort of works but we often don’t have the time to test the entire feature set. We have integration tests but they also don’t cover the entire feature set.

Enter DeepEval. We’ve only recently started to implement DeepEval but the vision of the future is clear. DeepEval lets you “unit test” features of LLM-enabled software. It’s open-source and comes with a lot of useful tools built-in. It’s a little finicky to get started but the learning curve is not that steep.