Is your LLM system a tool or toy?

TL;DR: If you can’t properly eval your change to a prompt, you are using a toy, not a tool.

Danger: strong opinions ahead. Mostly relevant to e.g. programming or chat / command line tool environments. YMMV. If something works for you, keep using it :)

Customizable LLMs are everywhere now. There are lots of systems where you can add some “system prompt” - add a “role” or “AGENT.md” or something like that.

I’m just going to come out and say it:

MOST LLM-BASED SYSTEMS RIGHT NOW ARE TOYS, NOT TOOLS

The distinction between a tool and a toy, to me is the following: when you want to change your “system prompt” in some way,

do you have a way to repeatably run old inputs and see what the actual effects of the change are?
is it easy to mechanically switch prompts between old and new versions and back?
do you have a way to easily run multiple rollouts of a single input and see how many succeed or fail by some criterion?

If the answer to either of these question is no, you’re not using a tool, you are using a toy. For example, if the only way to run your system+prompt automatically on a bunch of inputs is by clicking around furiously. (Even in that case, you could build an eval harness around it by yourself but that’s a nontrivial amount of work).

Don’t get me wrong, toys certainly have their place, people learn best through play. But for real, professional use, you will want to start tuning your entire system, including your prompt.

And because the LLMs are complex, stochastic systems, it is not straightforward to improve them. The first steps are easy but once you are improving your system by 2% steps, it’s really difficult to see whether progress in one area leads to regressions in another. OTOH, in real, professional use, 2% steps start mattering quickly even if you only make one every week.

For optimizing other complex, stochastic systems: the frontend industry eventually figured out randomized A/B testing and why that’s king (just as it has been in medicine all along). Also, of course, that’s also how the LLMs themselves are tested, using large benchmarks to see how well they do.

This is not the only thing that separates tools and toys: being model-agnostic, allowing comparison between models like prompts above, allowing automatic switching between models to save cost, etc etc etc.

I’ve been looking for the optimal system to use LLMs to do programming for a while now. Zed has been interesting but there’s still something about the whole process that has seemed off. Realizing the above has opened my eyes to what I actually want to see; I want a real tool, not a toy.

Enjoy Reading This Article?