It’s table stakes now to give background coding agents a way to run Bash commands: OpenAI Codex does, Google Jules does, Devin does, and so on.
> There must be a standard implementation of a Bash tool by now?
Our friends at OpenAI and Anthropic have helpfully provided some reference implementations for Bash tools. Great! But wait! You may notice one key difference in these implementations - Anthropic’s example creates a persistent Bash session for the LLM to interact with, whereas OpenAI’s example runs an LLM-generated Bash command directly as a subprocess.
> Is one better than the other?
As it turns out, this is the big question! Let’s start with the subprocess implementation.
The child_process
era
We mostly write TypeScript at Engine Labs — the equivalent of Python’s subprocess
module that the reference implementations use is Node’s child_process
module. Our earliest Bash tool effectively did what OpenAI’s reference implementation does — it ran LLM-generated commands as child processes and returned command output.
import { exec } from "child_process";
import { promisify } from "util";
const execAsync = promisify(exec);
// error handling omitted for clarity
async function BashTool(command: string): Promise<string> {
const { stdout } = await execAsync(command);
return stdout;
}
This worked great for quite a while! Switching to Bun as our runtime gave us a slightly more ergonomic version of child_process.exec
for free: Bun Shell.
import { $ } from "bun";
// error handling omitted for clarity
async function BashTool(command: string): Promise<string> {
const output = await $`${{ raw: command }}`.text();
return output;
}
This lightweight implementation served us well for a while until some edge cases started turning up.
User interaction
Some commands ask for user input while they’re running. One workaround is to set environment variables to indicate that this isn’t possible, e.g. DEBIAN_FRONTEND=noninteractive
.
The agent can also be instructed to avoid commands that require user input, or to pass flags to disable user input where possible (e.g. sudo apt-get install -y ...
).
REPLs
If the agent wants to start a REPL with something like python3
or node
, this will not work at all since there is no persistent Bash session. The only mitigation for this was to instead instruct our agent to write code to file with its other tools and only run the file with the Bash tool.
Timeouts
Alongside REPLs, there are other Bash commands that do not terminate without further user input, so we had to add a timeout to our Bash tool to handle these cases.
> How long should that timeout be?
Ideally, we would have liked to support arbitrarily long-running commands like long compilation steps or docker build
but having a timeout meant that wasn’t possible. So, we picked something sensible like 5 minutes and prompted our agent about the timeout for its Bash tool.
Switching to a persistent Bash session
One annoying problem with the workarounds was the non-determinism of Bash tool failures and timeouts. Whether or not a call to the Bash tool was successful was dependent on how well the agent followed instructions and how long a given command took to run.
We eventually got tired of this unpredictability and the artificial restriction on the types of command that the agent was allowed to run, so we made the switch to a persistent Bash session.
After a brief but painful attempt to implement our own terminal, we settled on using node-pty
, the pseudo-terminal library behind VS Code’s terminal emulato
However, we now had a new problem: how do we know if a command has finished running, or is waiting for user input? Unfortunately for us, this is pretty close to asking if we can solve the halting problem.
“Solving” the halting problem
Though it perhaps seems like we were doomed from the beginning, we didn’t set out to handle every possible input, just most of the ones that our agents might run during the course of creating a PR. There were a number of tricks and heuristics that we thought we might be able to use to approximate a solution.
Many implementations (including Anthropic’s reference one from above) rely solely on a timeout to decide when the Bash session is accepting further user input. This works for short-running commands, but is not ideal for longer-running commands, as the tool will either error, or only return command output prior to the timeout.
Anthropic’s computer-use demo has a Bash tool implementation that offers one neat trick. This implementation appends an echo
command after each LLM command that prints a sentinel value that can be searched for in the Bash tool output.
This works for commands that return the Bash prompt. This does not work if the command starts a REPL.
A brief diversion into strace
We tried briefly to channel our inner Linux wizards to see if we could somehow trace the system calls that the node-pty
process was making to see if it was accepting user input before we attempted to send LLM-generated commands.
A combination of not being certain we’d caught all the ways that a process could be waiting for user input and trouble running sudo
in the strace
-d terminal process led us to abandon this method.
(We think that it’s probably possible to implement a solid strace
-based solution, but at the time, it was causing more trouble than it was worth. Let us know if you reckon you can do it!)
Output stability
One of the core tricks we use to decide whether the Bash session is ready for more input is to watch for any change in output over a short window of time.
For example, short-running commands like ls
will end quickly, returning command output and then displaying the Bash prompt. After some time, we deem the output to be “stable” and can return the output to the agent.
engine@10:~/project$ ls
CONTRIBUTING.md Dockerfile LICENSE README.md package-lock.json package.json server.js tests
engine@10:~/project$
For a long-running command, e.g. running the python3
REPL, the Bash session will display the Python REPL prompt and nothing else. Again, after some time, the output is deemed “stable” and we can return the output to the agent.
engine@10:~/project$ python3
Python 3.12.3 (main, Jun 18 2025, 17:59:45) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
It may seem like that’s enough! But there are two more things to consider:
- How long exactly should we wait for output to stabilise?
- How much latency in the agent loop is tolerable?
For point 1 we need to consider what happens when a command deliberately has no output for a period of time, and is also long running. For example, docker build --quiet
suppresses output until completion, then prints the image ID — so it looks ‘stable’ immediately but can actually run for a long time.
> Fine, let’s just set the output stability timeout to something large, like 10 minutes
Sure, that’ll work, but that brings us to point 2: we’d then be waiting 10 minutes for every command that the agent wants to run, including things like ls
.
> So what next? How often is this kind of command a problem?
It’s definitely not common, but we have seen cases where this has been a persistent issue - mostly related to build steps, linting and typechecking for large projects.
In general, we’ve found that the more robust a Bash tool is to agent inputs, the better the agent performs - so we use a couple more tricks to handle these kinds of edge cases.
> And these tricks are…?
That would be telling! We can say that there are no subagents, or extra terminal tools involved. We’re pretty pleased with the success rate of our final incarnation of our Bash tool - keep your eyes peeled for a submission to Terminal Bench!
Extra notes
Security
We haven’t mentioned sandboxing or security because our agents and their terminals run inside isolated Firecracker VMs, so there’s not a huge amount to worry about.
Control characters
We also didn’t talk about how to handle control characters and arrow keys in the node-pty
version of the Bash tool. We’ve done this for our Bash tool but we leave it as an exercise to the reader. (It’s simpler than it sounds — we’re happy to compare notes if you’re curious!)