Yesterday I wrote a post about exactly this. Software development, as the act of manually producing code, is dying. A new discipline is being born. It is much closer to proper engineering.
Like an engineer overseeing the construction of a bridge, the job is not to lay bricks. It is to ensure the structure does not collapse.
The marginal cost of code is collapsing. That single fact changes everything.
Today I gave a lecture to my undergraduate data structures students about the evolution of CPU and GPU architectures since the late 1970s. The main themes:
- Through the last two decades of the 20th century, Moore’s Law held and ensured that more transistors could be packed into next year’s chips that could run at faster and faster clock speeds. Software floated on a rising tide of hardware performance so writing fast code wasn’t always worth the effort.
- Power consumption doesn’t vary with transistor density but varies with the cube of clock frequency, so by the early 2000s Intel hit a wall and couldn’t push the clock above ~4GHz with normal heat dissipation methods. Multi-core processors were the only way to keep the performance increasing year after year.
- Up to this point the CPU could squeeze out performance increases by parallelizing sequential code through clever scheduling tricks (and compilers could provide an assist by unrolling loops) but with multiple cores software developers could no longer pretend that concurrent programming was only something that academics and HPC clusters cared about.
CS curricula are mostly still stuck in the early 2000s, or at least it feels that way. We teach big-O and use it to show that mergesort or quicksort will beat the pants off of bubble sort, but topics like Amdahl’s Law are buried in an upper-level elective when in fact it is much more directly relevant to the performance of real code, on real present-day workloads, than a typical big-O analysis.
In any case, I used all this as justification for teaching bitonic sort to 2nd and 3rd year undergrads.
My point here is that Simon’s assertion that “code is cheap” feels a lot like the kind of paradigm shift that comes from realizing that in a world with easily accessible massively parallel compute hardware, the things that matter for writing performant software have completely shifted: minimizing branching and data dependencies produces code that looks profoundly different than what most developers are used to. e.g. running 5 linear passes over a column might actually be faster than a single merged pass if those 5 passes touch different memory and the merged pass has to wait to shuffle all that data in and out of the cache because it doesn’t fit.
What all this means for the software development process I can’t say, but the payoff will be tremendous (10-100x, just like with properly parallelized code) for those who can see the new paradigm first and exploit it.
I use AI in my workflow mostly for simple boilerplate, or to troubleshoot issues/docs.
I've dipped into agentic work now and again, but never been very impressed with the output (well, that there is any functioning output is insanely impressive, but it isn't code I want to be on the hook for complaining).
I hear a lot of people saying the same, but similarly a bunch of people I respect saying they barely write code anymore. It feels a little tricky to square these up sometimes.
Anyway, really looking forward to trying some if these patterns as the book develops to see if that makes a difference. Understanding how other peopke really use these tools is a big gap for me.
Pretty recently (a couple weeks ago). I give agentic workflows a go every couple of weeks or so.
I should say, I don't find them abysmal, but I tend to work in codebases where I understand them, and the patterns really well. The use cases I've tried so far, do sort of work, just not yet at least, faster than I'm able to actual write the code myself.
I’m not OP but every time I post a comment with this sentiment I get told “the latest models are what you need”. If every 3 months you are saying “it’s ready as long as you use the latest model”, then it wasn’t ready 3 months ago and it’s not likely to be ready now.
To answer your question, I’ve tried both Claude code and Antigravity in the last 2 weeks and I’m still finding them struggling. AG with Gemini regularly gets stuck on simple issues and loops until I run out of requests, and Claude still just regularly goes on wild tangents not actually solving the problem.
My experience is that the first iteration output from a single agent is not what I want to be on the hook for. What squares it for me with "not writing code anymore" is the iterative process to improve outputs:
1) Having review loops between agents (spawn separate "reviewer" agents) and clear tests / eval criteria improved results quite a bit for me.
2) Reviewing manually and giving instructions for improvements is necessary to have code I can own
I was in the same boat as you until I saw DHH post about how he’s changed his use of agents. In his talk with Lex Fridman his approach was similar to mine and it really felt like a kernel of sanity amongst the hype. So when he said he’s changed his approach I had another look. I’m using agents (Claude code) every day now. I still write code every day too. (So does Dax Raad from OpenCode to throw a bit more weight behind this stance). I’m not convinced the models can own a production code base and that therefore engineers need to maintain their skills sufficiently to be responsible. I find agents helpful for a lot of stuff, usually heavily patterned code with a lot of prior art. I find CC consistently sucks at writing polars code. I honestly don’t enjoy using agents at all and I don’t think anyone can honestly claim they know how this is going to shake out. But I feel by using the tools myself I have a much stronger sense of reality amongst the hype.
Linear walkthrough: I ask my agents to give me a numbered tree. Controlling tree size specifies granularity. Numbering means it's simple to refer to points for discussion.
Other things that I feel are useful:
- Very strict typing/static analysis
- Denying tool usage with a hook telling the agent why+what they should do (instead of simple denial, or dangerously accepting everything)
I've experimented with agentic coding/engineering a lot recently. My observation is that software that is easily tested are perfect for this sort of agentic loop.
In one of my experiments I had the simple goal of "making Linux binaries smaller to download using better compression" [1]. Compression is perfect for this. Easily validated (binary -> compress -> decompress -> binary) so each iteration should make a dent otherwise the attempt is thrown out.
Lessons I learned from my attempts:
- Do not micro-manage. AI is probably good at coming up with ideas and does not need your input too much
- Test harness is everything, if you don't have a way of validating the work, the loop will go stray
- Let the iterations experiment. Let AI explore ideas and break things in its experiment. The iteration might take longer but those experiments are valuable for the next iteration
- Keep some .md files as scratch pad in between sessions so each iteration in the loop can learn from previous experiments and attempts
You have to have really good tests as it fucks up in strange ways people don't (because I think experienced programmers run loops in their brain as they code)
Good news - agents are good at open ended adding new tests and finding bugs. Do that. Also do unit tests and playwright. Testing everything via web driving seems insane pre agents but now its more than doable.
For web apps, explictly asking the agent to build in sensible checkpoints and validate at the checkpoint using Playwright has been very successful for me so far. It prevents the agent from strating off course and struggling to find its way back. That and always using plan mode first, and reviewing the plan for evidence of sensible checkpoints. /opusplan to save tokens!
I primarily use AI for understanding codebases myself. My prompt is:
"deeply understand this codebase, clearly noting async/sync nature, entry points and external integration. Once understood prepare for follow up questions from me in a rapid fire pattern, your goal is to keep responses concise and always cite code snippets to ensure responses are factual and not hallucinated. With every response ask me if this particular piece of knowledge should be persistent into codebase.md"
Both the concise and structure nature (code snippets) help me gain knowledge of the entire codebase - as I progressively ask complex questions on the codebase.
The underlying technology is still improving at a rapid pace. Many of last year's tricks are a waste of tokens now. Some ideas seem less fragile: knowing two things allows you to imagine the confluence of the two so you know to ask. Other things are less so: I'm a big fan of the test-based iteration loop; it is so effective that I suspect almost all users have arrived at it independently[0]. But the emergent properties of models are so hard to actually imagine. A future sufficiently-smart intelligence may take a different approach that is less search and more proof. I wouldn't bet on it, but I've been surprised too many times over the last few years.
Is there anything about reviewing the generated code? Not by the author but by another human being.
Colleagues don’t usually like to review AI generated code. If they use AI to review code, then that misses the point of doing the review. If they do the review manually (the old way) it becomes a bottleneck (we are faster at producing code now than we are at reviewing it)
Like an engineer overseeing the construction of a bridge, the job is not to lay bricks. It is to ensure the structure does not collapse.
The marginal cost of code is collapsing. That single fact changes everything.
https://nonstructured.com/zen-of-ai-coding/
- Through the last two decades of the 20th century, Moore’s Law held and ensured that more transistors could be packed into next year’s chips that could run at faster and faster clock speeds. Software floated on a rising tide of hardware performance so writing fast code wasn’t always worth the effort.
- Power consumption doesn’t vary with transistor density but varies with the cube of clock frequency, so by the early 2000s Intel hit a wall and couldn’t push the clock above ~4GHz with normal heat dissipation methods. Multi-core processors were the only way to keep the performance increasing year after year.
- Up to this point the CPU could squeeze out performance increases by parallelizing sequential code through clever scheduling tricks (and compilers could provide an assist by unrolling loops) but with multiple cores software developers could no longer pretend that concurrent programming was only something that academics and HPC clusters cared about.
CS curricula are mostly still stuck in the early 2000s, or at least it feels that way. We teach big-O and use it to show that mergesort or quicksort will beat the pants off of bubble sort, but topics like Amdahl’s Law are buried in an upper-level elective when in fact it is much more directly relevant to the performance of real code, on real present-day workloads, than a typical big-O analysis.
In any case, I used all this as justification for teaching bitonic sort to 2nd and 3rd year undergrads.
My point here is that Simon’s assertion that “code is cheap” feels a lot like the kind of paradigm shift that comes from realizing that in a world with easily accessible massively parallel compute hardware, the things that matter for writing performant software have completely shifted: minimizing branching and data dependencies produces code that looks profoundly different than what most developers are used to. e.g. running 5 linear passes over a column might actually be faster than a single merged pass if those 5 passes touch different memory and the merged pass has to wait to shuffle all that data in and out of the cache because it doesn’t fit.
What all this means for the software development process I can’t say, but the payoff will be tremendous (10-100x, just like with properly parallelized code) for those who can see the new paradigm first and exploit it.
I've dipped into agentic work now and again, but never been very impressed with the output (well, that there is any functioning output is insanely impressive, but it isn't code I want to be on the hook for complaining).
I hear a lot of people saying the same, but similarly a bunch of people I respect saying they barely write code anymore. It feels a little tricky to square these up sometimes.
Anyway, really looking forward to trying some if these patterns as the book develops to see if that makes a difference. Understanding how other peopke really use these tools is a big gap for me.
I think trying agents to do larger tasks was always very hit or miss, up to about the end of last year.
In the past couple of months I have found them to have gotten a lot better (and I'm not the only one).
My experience with what coding assistants are good for shifted from:
smart autocomplete -> targeted changes/additions -> full engineering
Pretty recently (a couple weeks ago). I give agentic workflows a go every couple of weeks or so.
I should say, I don't find them abysmal, but I tend to work in codebases where I understand them, and the patterns really well. The use cases I've tried so far, do sort of work, just not yet at least, faster than I'm able to actual write the code myself.
To answer your question, I’ve tried both Claude code and Antigravity in the last 2 weeks and I’m still finding them struggling. AG with Gemini regularly gets stuck on simple issues and loops until I run out of requests, and Claude still just regularly goes on wild tangents not actually solving the problem.
1) Having review loops between agents (spawn separate "reviewer" agents) and clear tests / eval criteria improved results quite a bit for me. 2) Reviewing manually and giving instructions for improvements is necessary to have code I can own
Other things that I feel are useful:
- Very strict typing/static analysis
- Denying tool usage with a hook telling the agent why+what they should do (instead of simple denial, or dangerously accepting everything)
- Using different models for code review
In one of my experiments I had the simple goal of "making Linux binaries smaller to download using better compression" [1]. Compression is perfect for this. Easily validated (binary -> compress -> decompress -> binary) so each iteration should make a dent otherwise the attempt is thrown out.
Lessons I learned from my attempts:
- Do not micro-manage. AI is probably good at coming up with ideas and does not need your input too much
- Test harness is everything, if you don't have a way of validating the work, the loop will go stray
- Let the iterations experiment. Let AI explore ideas and break things in its experiment. The iteration might take longer but those experiments are valuable for the next iteration
- Keep some .md files as scratch pad in between sessions so each iteration in the loop can learn from previous experiments and attempts
[1] https://github.com/mohsen1/fesh
Good news - agents are good at open ended adding new tests and finding bugs. Do that. Also do unit tests and playwright. Testing everything via web driving seems insane pre agents but now its more than doable.
"deeply understand this codebase, clearly noting async/sync nature, entry points and external integration. Once understood prepare for follow up questions from me in a rapid fire pattern, your goal is to keep responses concise and always cite code snippets to ensure responses are factual and not hallucinated. With every response ask me if this particular piece of knowledge should be persistent into codebase.md"
Both the concise and structure nature (code snippets) help me gain knowledge of the entire codebase - as I progressively ask complex questions on the codebase.
- tell the agent to write a plan, review the plan, tell the agent to implement the plan
- allow the agent to “self discover” the test harness (eg. “Validate this c compiler against gcc”)
- queue a bunch of tasks with // todo … and yolo “fix all the todo tasks”
- validate against a known output (“translate this to rust and ensure it emits the same byte or byte output as you go”)
- pick a suitable language for the task (“go is best for this task because I tried several languages and it did the best for this domain in go”)
Feels like it’s a lot of words to say what amounts to make the agent do the steps we know works well for building software.
0: https://wiki.roshangeorge.dev/w/Blog/2025-12-01/Grounding_Yo...
Colleagues don’t usually like to review AI generated code. If they use AI to review code, then that misses the point of doing the review. If they do the review manually (the old way) it becomes a bottleneck (we are faster at producing code now than we are at reviewing it)