PSA - Sandbox your LLMs.

TL;DR: No real harm done but a minor yikes: my first contact with an LLM agenting outside the boundaries I was thinking it had (wishfully thinking, it turns out).

If you’re like me, you haven’t yet really taken seriously the warnings about LLMs. “I’m watching it all the time”, “It only happens in weird contrived scenarios”, …

But not me after today.

I was using Claude 4 Sonnet in thinking mode through Zed, working on a project that had a local python dependency, ../linopx. I’m watching Claude trying to use it, not knowing the API (I hadn’t yet provided it as context).

All of a sudden, BOOM:

See that? Just like that, Claude went outside the imaginary boundaries I had thought it had. Instead of playing inside the project directory, it accessed ../linopx to figure out how to use the package.

My heart skipped a beat. I hadn’t realized this was possible (even though I had read about Claude sending emails to the authorities).

I know, this is not a huge violation per se. It’s not like I wasn’t going to give it access to that source code myself. And it was trying to accomplish the goal I gave it.

But somehow, it still is way more than I would have expected from inside an editor (in Zed, I can’t manually add ../linopx/file as context for the LLM, only files inside the root directory…).

Suddenly, the whole alignment and paperclip maximizers discussion seem a lot more concrete.

So, from now on, I’m going to conscientiously sandbox my LLMs before giving them shell access. You should too.

(The really scary part is: is that even remotely enough? If it’s able to do this now, in a couple of years it could easily look for security holes in the sandbox to accomplish the same thing — just because I told it to do something and didn’t provide enough context.)




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Human Spec - how to rein in Claude
  • The Jax adjoint scatter trick
  • Implicit modeling custom screws and nuts using libfive
  • Why you should write Jax functions without broadcasting
  • grpo_server: GRPO for fun and agents and profit