I was a bit confused as to how everything works until I read it in detail. Really cool tools, but I think one thing that would help in the introduction is: saying explicitly that the generated .md document is for you (the user) to read through, observe the output of the CLI call, and ensure that the output matches what you would expect.
It's basically an automated test, but at a higher abstraction level and with manual verification--using CLI tools rather than a test harness. Really great work!
I love your content, but I wish you'd make your blog theme responsive for wider screens/non-mobile. I prefer to read content like this on a large screen.
Showboat seems like it could actually be quite useful for humans too, just for making quick notes from a CLI without opening an editor. The "pop" command makes me wonder if there would be a benefit to also having an array-like in addition to the stack-like interface. It seems like it would be fairly trivial to generate an index of markdown blocks so that they could be edited individually.
I like the idea of Rodney, but I wonder if you might actually have better results by asking the agent to generate equivalent Selenium scripts instead. I'm specifically suggesting Selenium because it's been around so long so I assume there's a lot of Selenium in the LLMs training data, but there are other options that might work too.
First time someone's asked for the site to be wider! I have it setup so on a wide screen the text is still a readable width, do you think it needs to bump up that max width a bit more?
I've found the models are so good at Playwright that I don't consider Selenium any more. Rodney is my first experiment not using Playwright.
I second the request to make the site responsive. When I load the page the CSS constrains the main content to 560px and the whole page is constrained to 940px. Here's how it displays on my system:
Your tastes are your own, and there is an argument for just filling the window, but you won’t find a typographic authority that advocates setting body text much wider than that (and I would agree with them).
Passing tests in your repo are great documentation of the tool at a microscopic level. And rerunning tests only burns tokens on failures (since passed tests just print a dot) so it’s token efficient too.
Some other neat tricks:
- For greater efficiency configure your test runner to print nothing (not even a dot/filename) for test successes. Agents don’t need progress dots, only the exit code & failure details
- Have your agent implement a 10ms timeout per test. pytest has hooks to do this. The agent will see tests time out and mock out all I/O and third party code - why test what one assumes third parties tested already! Your test suite is CPU-bound without a shared database, has no shared data and no tests that interfere with or depend on each other, so tests can run in parallel.
I'm OK with longer running tests because I always have them run against a real database (often SQLite, sometimes PostgreSQL) and real files created in temporary directories but I can see how the time limit might be useful for tests that don't need those kind of components.
I'll be sure to try these out. I've been building my own alternative to Beads with a concept called "gates" which do not let you close tasks as complete until a gate passes. Would love to throw these in as "gates" for my current workflow.
Out of curiosity, what is the advantage of using Rodney when Playwright has the same set of features and AI understands how to write a Playwright script very well?
Showboat documents look neater if there are single one-line commands that do something useful. Dumping a full Playwright script into a cell is less readable.
Showboat also has a special feature where you can embed an image directly in the document by running:
showboat image doc.md 'rodney screenshot'
The command you call should return a path to an image file as the last line of output. Rodney does exactly that.
It may well turn out that Rodney is unnecessary and people find better patterns using Showboat with existing tools like playwright-cli - in which case it won't matter because Showboat and Rodney aren't coupled to each other at all.
Showboat is definitely the more significant of the two projects.
Showboat does look neat -- though I feel like I don't fully grasp the use case yet! Maybe I just need to use more AI :)
As for rodney, I was thinking that although Playwright is good, there are a lot of cases where AI can't really understand things that a human would grasp instantly. For instance, often a broken e2e test will get stuck at some point. AI never seems to grasp this - it always thinks it's a timeout and goes down latency rabbitholes. If there was some way to give AI a log of "this line was reached at this point", you could really improve the state of the world :)
Very interesting! I encountered the problems these tools are trying to tackle just recently while trying to guide an agent into creating an in-browser tool for me. Closing the loop on a web interface isn't as simple as CLI-only tools. I should give this a try.
It's also interesting that you've shifted to Go for your agent-coded CLI tools, Simon.
I'm dabbling with Go at the moment for small tools, mainly as an excuse to learn a new language but also because having a single standalone binary is convenient for shuttling these tiny little tools around.
... but then I'm mostly running them with "uvx name-of-tool" because it turns out Python's packaging infrastructure for binary tools is so good!
Right, standalone binaries for CLI tools is great. And if one has Go installed, they can just `go run ...` any tool from its GitHub path, all installation/build/caching happens automagically (meaning the execution is immediate after the first run).
But I can definitely see how someone with `uv` muscle memory wants everything in the same command.
`uv` is the best thing that happened to the Python ecosystem since... I don't know... maybe Numpy.
If you're coming from the Python world, definitely. I find `go install github.com/simonw/rodney@latest` equally easy. :D Although you need the Go tooling installed, of course. But so much agree, Go is great for CLIs!
Yes, very much so. It's a much thinner, less feature-rich alternative.
It would be interesting to experiment with Jupyter notebooks as an alternative that could work in Claude Code for web.
I had a poke around just now and couldn't find an existing CLI tool that lets you build those up a section at a time in the same way as Showboat. I did find this Python library though:
uv run --with nbformat python -c '
import nbformat
nb = nbformat.v4.new_notebook()
nb.cells.append(nbformat.v4.new_markdown_cell("# NBTerm Exploration"))
nb.cells.append(nbformat.v4.new_code_cell("import sys\nprint(f\"Python {sys.version}\")"))
nb.cells.append(nbformat.v4.new_code_cell("x = [i**2 for i in range(10)]\nprint(x)"))
nb.cells.append(nbformat.v4.new_code_cell("sum(x)"))
with open("demo.ipynb", "w") as f:
nbformat.write(nb, f)
'
So you could tell the agent to run code like that and then inspect the `demo.ipynb` notebook later on. It doesn't show the result of evaluating the cells though, you need to run this afterwards to have that happen:
uv run --with nbformat --with nbclient --with ipykernel python -c '
import nbformat
from nbclient import NotebookClient
nb = nbformat.read("demo.ipynb", as_version=4)
client = NotebookClient(nb, timeout=60)
client.execute()
nbformat.write(nb, "demo_executed.ipynb")
'
Cool, I have to say I find the idea intriguing as a tracability tool in they that LLMs can show you step be step how a program is assembled / an output was generated.
I think it's more about the interface than the output. The agent can add stuff to a markdown file with simple cli commands rather than a more complex editor or file interface.
Yeah it's pretty new indeed. It's very effective at doing pretty much any browser automation task and I have to say that using it with the included skill is pretty seamless.
Wait, why should an LLM simply not just write directly to the markdown file instead of going through the extra step of using a cli tool which is basically `echo 'something' >> file.md` but with templates that should really be in a prompt instead of a being in a compiled binary? Did Claude come up with the idea for this as well?
Also, I am sure you must already know about Playwright mcp so why this? If your goal isn't to make the cli human-friendly, which is the only advantage clis have over mcps doing the same thing, then why not just use the mcp? It doesn't even handle multiple sessions and has a single global state file––this is slop.
Because I don't want it to write to the markdown file directly. I want it to tell me the command it runs and I then run that command and write both the command and the output to the file.
Otherwise it's just writing a document, not building a demo you can review.
As far as I can tell you can't hook MCPs up to Claude Code for web.
I originally planned to support separate sessions but decided to leave that out for the initial release. I've opened an issue for that here: https://github.com/simonw/rodney/issues/6
Sounds like both of these tools could be one shot by either Claude or Codex.
Or alternatively, just be a skill versus a tool.
My “agents” already demo stuff all the time by just being prompted to do so. I have notations in my standard Agents.md for how I want my documentation, testing etc.
I guess it would still make sense to have "demo" and "browser-use" skills, so that the agent can reach for them proactively? I always try to remove as much friction as possible for myself, one little bit at a time.
My problem is that I work in dozens of different repos generally using Claude Code for web, which doesn't have a way to install extra global skills yet.
I don't want to duplicate my skills into all those repos (and keep them updated) so I prefer the "uvx tool --help" pattern.
That's actually one of the things that has kept me from using Claude Code web (that, and I often need a Chrome browser for the agent). But they must be working on it.
I saw an MCP I've set up on claude.ai show up in my local Claude Code MCP list the other day, it seems inevitable that there will be skills integration across environments as well at some point.
In working on Rodney I found out that the Claude Code for web environment has a Chrome browser installed already. It's a shame you can't see its output directly - even if it takes a screenshot there's no easy way to view it other than having it commit and push that to a branch in GitHub.
This comment is regurgitating Simon's post with too much adherence to the input tokens. The unnatural, promotional restating of proper nouns in constrained output is a notable LLM tell.
If you could actually detect AI content with high accuracy, you would sell it as a service and print money, but you can't, so you force all the rest of us to wade through posts like yours, claiming to tell the rest of us what is and isn't AI, which are FAR more annoying, disruptive, and low signal than the post you're commenting on, which is intelligent, adds to the conversation, and is, by my read, almost certainly actually human authored, just written by someone who knows how to write.
Human heuristics - I've prompted millions of tokens across every frontier model iteration for all manner of writing styles and purposes - also helps greatly.
Concerning to me are long-time posters who (perhaps unknowingly) advance the decline of this human community by encouraging the people breaking HN guidelines. Perhaps spending a few hours on Moltbook might help develop such a heuristic, since "someone who knows how to write" is just a Claude model with a link to the blogpost.
If agents can generate text so easily, why would they be limited to Markdown instead of reStructuredText, AsciiDoc, or LaTeX which have rich features that help users understand text? I can understand developers refusing to adopt proper formats for documentation, but this seems odd for the bots. It doesn’t even generate the correct syntax block in Markdown using “bash” instead of “sh-session”.
MS GitHub can not only render rST & AsciiDoc (albeit pretty poorly since they the CSS is bad), but it also is employing its own fork of Markdown that isn’t compatible with other forks. “Just Markdown”, like the base spec, is so feature poor that everyone has their own incompatible fork which don’t make it render properly.
console/shell-session/sh-session (parsers call them different names) are for shell sessions, terminal sessions. Bash syntax is for Bash scripts, & these aren’t scripts but something you run in your terminal session.
I dunno. I’ve written a bit of LaTeX but does it really shine in this context? IMO the real advantage it has is that it can allow the user to express more complicated intents than Markdown (weird phrasing—my natural instinct was to call LaTeX more precise than Markdown, but Markdown is pretty precise for describing the type of file that it is good at…).
Anyway LLMs don’t have underlying intent so maybe it is fine to just let them express what they can in Markdown?
I think its primarily because that is the most common formatting in every editor now? I could be wrong. Markdown has become the standard for README files for over a decade now.
Winning a popularity contest doesn’t mean it’s good. That is the worst part of about these things as they just generate the most common denominator type code/tooling while also repeating anti-patterns/mistakes like the bash vs. sh-session/console issue I pointed out. Garbage in has been so much garbage out unfortunately.
Never said it was good, just making an observation that Markdown is most likely to be available to render OOTB in more editors. I don't think Markdown is bad necessarily either. It's "good enough" for simple document.
Documentation isn’t as simple document. There are tons of rich elements missing from the spec specifically for documentation… which is why so many resort to adopting one of the many incompatible forks of Markdown to try to get features that were missing (but are a part of reStructuredText & AsciiDoc for instance). That has a real tradeoff since these forks are not going to be compatible & they aren’t going to be as well defined as the specs for these richer lightweight markup syntax choices.
It's basically an automated test, but at a higher abstraction level and with manual verification--using CLI tools rather than a test harness. Really great work!
Showboat seems like it could actually be quite useful for humans too, just for making quick notes from a CLI without opening an editor. The "pop" command makes me wonder if there would be a benefit to also having an array-like in addition to the stack-like interface. It seems like it would be fairly trivial to generate an index of markdown blocks so that they could be edited individually.
I like the idea of Rodney, but I wonder if you might actually have better results by asking the agent to generate equivalent Selenium scripts instead. I'm specifically suggesting Selenium because it's been around so long so I assume there's a lot of Selenium in the LLMs training data, but there are other options that might work too.
I've found the models are so good at Playwright that I don't consider Selenium any more. Rodney is my first experiment not using Playwright.
https://i.postimg.cc/zDMD9nYD/Simon.png
Passing tests in your repo are great documentation of the tool at a microscopic level. And rerunning tests only burns tokens on failures (since passed tests just print a dot) so it’s token efficient too.
Some other neat tricks:
- For greater efficiency configure your test runner to print nothing (not even a dot/filename) for test successes. Agents don’t need progress dots, only the exit code & failure details
- Have your agent implement a 10ms timeout per test. pytest has hooks to do this. The agent will see tests time out and mock out all I/O and third party code - why test what one assumes third parties tested already! Your test suite is CPU-bound without a shared database, has no shared data and no tests that interfere with or depend on each other, so tests can run in parallel.
I'm OK with longer running tests because I always have them run against a real database (often SQLite, sometimes PostgreSQL) and real files created in temporary directories but I can see how the time limit might be useful for tests that don't need those kind of components.
Showboat documents look neater if there are single one-line commands that do something useful. Dumping a full Playwright script into a cell is less readable.
Showboat also has a special feature where you can embed an image directly in the document by running:
The command you call should return a path to an image file as the last line of output. Rodney does exactly that.It may well turn out that Rodney is unnecessary and people find better patterns using Showboat with existing tools like playwright-cli - in which case it won't matter because Showboat and Rodney aren't coupled to each other at all.
Showboat is definitely the more significant of the two projects.
As for rodney, I was thinking that although Playwright is good, there are a lot of cases where AI can't really understand things that a human would grasp instantly. For instance, often a broken e2e test will get stuck at some point. AI never seems to grasp this - it always thinks it's a timeout and goes down latency rabbitholes. If there was some way to give AI a log of "this line was reached at this point", you could really improve the state of the world :)
It's also interesting that you've shifted to Go for your agent-coded CLI tools, Simon.
... but then I'm mostly running them with "uvx name-of-tool" because it turns out Python's packaging infrastructure for binary tools is so good!
But I can definitely see how someone with `uv` muscle memory wants everything in the same command.
`uv` is the best thing that happened to the Python ecosystem since... I don't know... maybe Numpy.
It would be interesting to experiment with Jupyter notebooks as an alternative that could work in Claude Code for web.
I had a poke around just now and couldn't find an existing CLI tool that lets you build those up a section at a time in the same way as Showboat. I did find this Python library though:
So you could tell the agent to run code like that and then inspect the `demo.ipynb` notebook later on. It doesn't show the result of evaluating the cells though, you need to run this afterwards to have that happen:https://github.com/microsoft/playwright-cli
Different from the cli used for running tests etc that comes bundled with PlayWright
Sample use:
Main difference is Rodney can be installed as a single Go binary or via uv/pip, agent-browser is Rust and npm.
Looks like agent-browser was first released at the start of January, it's very new.
- E2E testing of browser components
- Taking screenshots before and after and having Claude look at them to double check things
- Driving it with an API and CLI as a headless browser
Will definitely give Rodney a look.
Also, I am sure you must already know about Playwright mcp so why this? If your goal isn't to make the cli human-friendly, which is the only advantage clis have over mcps doing the same thing, then why not just use the mcp? It doesn't even handle multiple sessions and has a single global state file––this is slop.
Otherwise it's just writing a document, not building a demo you can review.
As far as I can tell you can't hook MCPs up to Claude Code for web.
I originally planned to support separate sessions but decided to leave that out for the initial release. I've opened an issue for that here: https://github.com/simonw/rodney/issues/6
Or alternatively, just be a skill versus a tool.
My “agents” already demo stuff all the time by just being prompted to do so. I have notations in my standard Agents.md for how I want my documentation, testing etc.
I don't want to duplicate my skills into all those repos (and keep them updated) so I prefer the "uvx tool --help" pattern.
I saw an MCP I've set up on claude.ai show up in my local Claude Code MCP list the other day, it seems inevitable that there will be skills integration across environments as well at some point.
Please respect the Hacker News community and read https://news.ycombinator.com/item?id=46747998.
Human heuristics - I've prompted millions of tokens across every frontier model iteration for all manner of writing styles and purposes - also helps greatly.
Concerning to me are long-time posters who (perhaps unknowingly) advance the decline of this human community by encouraging the people breaking HN guidelines. Perhaps spending a few hours on Moltbook might help develop such a heuristic, since "someone who knows how to write" is just a Claude model with a link to the blogpost.
https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...
Thanks for your comment!
I didn't know about sh-session, is that documented anywhere?
console/shell-session/sh-session (parsers call them different names) are for shell sessions, terminal sessions. Bash syntax is for Bash scripts, & these aren’t scripts but something you run in your terminal session.
Anyway LLMs don’t have underlying intent so maybe it is fine to just let them express what they can in Markdown?