Claude is good at assembling blocks, but still falls apart at creating them

(approachwithalacrity.com)

78 points | by bblcla 1 day ago

8 comments

joshcsimmons 2 minutes ago
IDK I've been using opus 4.5 to create a UI library and it's been doing pretty well: https://simsies.xyz/ (still early days)
Granted it was building ontop of tailwind (shifting over to radix after the layoff news). Begs the question? What is a lego?
maxilevi 1 hour ago
LLMs are just really good search. Ask it to create something and it's searching within the pretrained weights. Ask it to find something and it's semantically searching within your codebase. Ask it to modify something and it will do both. Once you understand its just search, you can get really good results.
[-]
- fennecbutt 16 minutes ago
  I agree somewhat, but more when it comes to its use of logic - it only gleans logic from human language which as we know is a fucking mess.
  I've commented before on my belief that the majority of human activity is derivative. If you ask someone to think of a new kind of animal, alien or random object they will always base it off things that they have seen before. Truly original thoughts and things in this world are an absolute rarity and the majority of supposed original thought riffs on what we see others make, and those people look to nature and the natural world for inspiration.
  We're very good at taking thing a and thing b and slapping them together and announcing we've made something new. Someone please reply with a wholly original concept. I had the same issue recently when trying to build a magic based physics system for a game I was thinking of prototyping.
- bhadass 1 hour ago
  better mental model: it's a lossy compression of human knowledge that can decompress and recombine in novel (sometimes useful, sometimes sloppy) ways.
  classical search simply retrieves, llms can synthesize as well.
  [-]
  - andy99 20 minutes ago
    No, this describes the common understanding of LLMs and adds little to just calling it AI. The search is the more accurate model when considering their actual capabilities and understanding weaknesses. “Lossy compression of human knowledge” is marketing.
    [-]
    - XenophileJKO 7 minutes ago
      It is fundamentally and provably different than search because it captures things on two dimensions that can be used combinatorially to infer desired behavior for unobserved examples.
      1. Conceptual Distillation - Proven by research work that we can find weights that capture/influence outputs that align with higher level concepts.
      2. Conceptual Relations - The internal relationships capture how these concepts are related to each other.
      This is how the model can perform acts and infer information way outside of it's training data. Because if the details map to concepts then the conceptual relations can be used to infer desirable output.
  - DebtDeflation 21 minutes ago
    Information Retrieval followed by Summarization is how I view it.
  - RhythmFox 1 hour ago
    This isn't strictly better to me. It captures some intuitions about how a neural network ends up encoding its inputs over time in a 'lossy' way (doesn't store previous input states in an explicit form). Maybe saying 'probabilistic compression/decompression' makes it a bit more accurate? I do not really think it connects to your 'synthesize' claim at the very end to call it compression/decompression, but I am curious if you had a specific reason to use the term.
    [-]
    - XenophileJKO 19 minutes ago
      It's really way more interesting that that.
      The act of compression builds up behaviors/concepts of greater and greater abstraction. Another way you could think about it is that the model learns to extract commonality, hence the compression. What this means is because it is learning higher level abstractions AND the relationships between these higher level abstractions, it can ABSOLUTELY learn to infer or apply things way outside their training distribution.
  - andrei_says_ 1 hour ago
    “Novel” to the person who has not consumed the training data. Otherwise, just training data combined in highly probable ways.
    Not quite autocomplete but not intelligence either.
    [-]
    - pc86 36 minutes ago
      What is the difference between "novel" and "novel to someone who hasn't consumed the entire corpus of training data, which is several orders of magnitude greater than any human being could consume?"
      [-]
      - szundi 21 minutes ago
        [dead]
    - soulofmischief 1 hour ago
      Citation needed that grokked capabilities in a sufficiently advanced model cannot combinatorially lead to contextually novel output distributions, especially with a skilled guiding hand.
      [-]
      - arcanemachiner 52 minutes ago
        Pretty sure burden of proof is on you, here.
        [-]
        soulofmischief 44 minutes ago
        It's not, because I haven't ruled out the possibility. I could share anecdata about how my discussions with LLMs have led to novel insights, but it's not necessary. I'm keeping my mind open, but you're asserting an unproven claim that is currently not community consensus. Therefore, the burden of proof is on you.
- cultureulterior 9 minutes ago
  This is not true.
- johnisgood 1 hour ago
  Calling it "just search" is like calling a compiler "just string manipulation". Not false, but aggressively missing the point.
  [-]
  - maxilevi 1 hour ago
    I don't mean search in the reductionist way but rather that its much better at translating, finding and mapping concepts if everything is provided vs creating from scratch. If it could truly think it would be able to bootstrap creations from basic principles like we do, but it really can't. Doesn't mean its not a great powerful tool.
    [-]
    - ordinaryatom 36 minutes ago
      > If it could truly think it would be able to bootstrap creations from basic principles like we do, but it really can't.
      alphazero?
  - oliverbennett 1 hour ago
    It feels like defining LLMs by what they're good at. Which also includes things like summarisation and grouping things.
  - emp17344 1 hour ago
    No, “just search” is correct. Boosters desperately want it to be something more, but it really is just a tool.
    [-]
    - johnisgood 1 hour ago
      Yes, it is a tool. No, it is not "just search".
      Is your CPU running arbitrary code "just search over transistor states"?
      Calling LLMs "just search" is the kind of reductive take that sounds clever while explaining nothing. By that logic, your brain is "just electrochemical gradients".
      [-]
      - jvanderbot 1 hour ago
        What would you add?
        To me it's "search" like a missile does "flight". It's got a target and a closed loop guidance, and is mostly fire and forget (for search). At that, it excels.
        I think the closed loop+great summary is the key to all the magic.
        [-]
        bitwize 1 hour ago
        Which is kind of funny because my standard quip is that AI research, beginning in the 1950s/1960s, and indeed much of late 20th century computer tech especially along the Boston/SV axis, was funded by the government so that "the missile could know where it is". The DoD wanted smarter ICBMs that could autonomously identify and steer toward enemy targets, and smarter defense networks that could discern a genuine missile strike from, say, 99 red balloons going by.
        soulofmischief 1 hour ago
        It's a prediction algorithm that walks a high-dimensional manifold, in that sense all application of knowledge it just "search", so yes, you're fundamentally correct but still fundamentally wrong since you think this foundational truth is the end and beginning of what LLMs do, and thus your mental model does not adequately describe what these tools are capable of.
        [-]
        jvanderbot 57 minutes ago
        Me? My mental model? I gave an analogy for Claude not a explanation for LLMs.
        But you know what? I was mentally thinking of both deep think / research and Claude code, both of which are literally closed loop. I see this is slightly off topic b/c others are talking about the LLM only.
        [-]
        soulofmischief 45 minutes ago
        Sorry, I should have said "analogy" and not "mental model", that was presumptuous. Maybe I also should have replied to the GP comment instead.
        Anyway, since we're here, I personally think giving LLMs agency helps unlock this latent knowledge, as it provides the agent more mobility when walking the manifold. It has a better chance at avoiding or leaving local minima/maxima, among other things. So I don't know if agentic loops are entirely off-topic when discussing the latent power of LLMs.
      - RhythmFox 1 hour ago
        I mean, actually not a bad metaphor, but it does depend on the software you are running as to how much of a 'search' you could say the CPU is doing among its transistor states. If you are running an LLM then the metaphor seems very apt indeed.
Scrapemist 19 minutes ago
Eventually you can show Claude how you solve problems, and explain the thought process behind it. It can apply these learnings but it will encounter new challenges in doing so. It would be nice if Claude could instigate a conversation to go over the issues in depth. Now it wants quick confirmation to plough ahead.
[-]
- fennecbutt 13 minutes ago
  Well I feel like this is because a better system would distill such learning into tokens not associated with a human language and that that could represent logic better than using English etc for it.
  I don't have the GPUs or time to experiment though :(
simonw 1 hour ago
I'm not entirely convinced by the anecdote here where Claude wrote "bad" React code:
> But in context, this was obviously insane. I knew that key and id came from the same upstream source. So the correct solution was to have the upstream source also pass id to the code that had key, to let it do a fast lookup.
I've seen Claude make mistakes like that too, but then the moment you say "you can modify the calling code as well" or even ask "any way we could do this better?" it suggests the optimal solution.
My guess is that Claude is trained to bias towards making minimal edits to solve problems. This is a desirable property, because six months ago a common complaint about LLMs is that you'd ask for a small change and they would rewrite dozens of additional lines of code.
I expect that adding a CLAUDE.md rule saying "always look for more efficient implementations that might involve larger changes and propose those to the user for their confirmation if appropriate" might solve the author's complaint here.
[-]
- Kuinox 52 minutes ago
  > My guess is that Claude is trained to bias towards making minimal edits to solve problems.
  I don't have the same feeling. I find that claude tends to produce wayyyyy too much code to solve a problem, compared to other LLMs.
- bblcla 53 minutes ago
  (Author here)
  > I'm not entirely convinced by the anecdote here where Claude wrote "bad" React code
  Yeah, that's fair - a friend of mine also called this out on Twitter (https://x.com/konstiwohlwend/status/2010799158261936281) and I went into more technical detail about the specific problem there.
  > I've seen Claude make mistakes like that too, but then the moment you say "you can modify the calling code as well" or even ask "any way we could do this better?" it suggests the optimal solution.
  I agree, but I think I'm less optimistic than you that Claude will be able to catch its own mistakes in the future. On the other hand, I can definitely see how a ~more intelligent model might be able to catch mistakes on a larger and larger scale.
  > I expect that adding a CLAUDE.md rule saying "always look for more efficient implementations that might involve larger changes and propose those to the user for their confirmation if appropriate" might solve the author's complaint here.
  I'm not sure about this! There are a few things Claude does that seem unfixable even by updating CLAUDE.md.
  Some other footguns I keep seeing in Python and constantly have to fix despite CLAUDE.md instructions are:
  - writing lots of nested if clauses instead of writing simple functions by returning early
  - putting imports in functions instead of at the top-level
  - swallowing exceptions instead of raising (constantly a huge problem)
  These are small, but I think it's informative of what the models can do that even Opus 4.5 still fails at these simple tasks.
  [-]
  - chapel 9 minutes ago
    Those Python issues are things I had to deal with earlier last year with Claude Sonnet 3.7, 4.0, and to a lesser extent Opus 4.0 when it was available in Claude Code.
    In the Python projects I've been using Opus 4.5 with, it hasn't been showing those issues as often, but then again the projects are throwaway and I cared more about the output than the code itself.
    The nice thing about these agentic tools is that if you setup feedback loops for them, they tend to fix issues that are brought up. So much of what you bring up can be caught by linting.
    The biggest unlock for me with these tools is not letting the context get bloated, not using compaction, and focusing on small chunks of work and clearing the context before working on something else.
    [-]
    - bblcla 1 minute ago
      Arguably linting is a kind of abstraction block!
  - ako 17 minutes ago
    > I agree, but I think I'm less optimistic than you that Claude will be able to catch its own mistakes in the future. On the other hand, I can definitely see how a ~more intelligent model might be able to catch mistakes on a larger and larger scale.
    Claude already does this. Yesterday i asked it why some functionality was slow, it did some research, and then came back with all the right performance numbers, how often certain code was called, and opportunities to cache results to speed up execution. It refactored the code, ran performance tests, and reported the performance improvements.
  - doug_durham 25 minutes ago
    That's where you come in as an experienced developer. You point out the issues and iterate. That's the normal flow of working with these tools.
    [-]
    - bblcla 22 minutes ago
      I agree! Like I said at the end of the tool, I think Claude is a great tool. In this piece, I'm arguing against the 'AGI' believers who think it's going to replace all developers.
- joshribakoff 53 minutes ago
  I expect that adding instructions that attempt to undo training produces worse results than not including the overbroad generalization in the training in the first place. I think the author isn’t making a complaint they’re documenting a tradeoff.
- AIorNot 53 minutes ago
  Well yes but the wider point is that it takes new Human skills to manage them - like a pair of horses so to speak under your bridle
  When it comes down to it these AI tools are like going to power tools or machines from the artisanal era
  - like going from surgical knife to a machine gun- so they operate at a faster pace without comprehending like humans - and without allowing humans time to comprehend all side effects and massive assumptions they make on every run in their context window
  humans have to adapt to managing them correctly and at the right scale to be effective and that becomes something you learn
michalsustr 1 hour ago
This article resonates exactly how I think about it as well. For example, at minfx.ai (a Neptune/wandb alternative), we cache time series that can contain millions of floats for fast access. Any engineer worth their title would never make a copy of these and would pass around pointers for access. Opus, when stuck in a place where passing the pointer was a bit more difficult (due to async and Rust lifetimes), would just make the copy, rather than rearchitect or at least stop and notify user. Many such examples of ‘lazy’ and thus bad design.
mikece 1 day ago
In my experience Claude is like a "good junior developer" -- can do some things really well, FUBARS other things, but on the whole something to which tasks can be delegated if things are well explained. If/when it gets to the ability level of a mid-level engineer it will be revolutionary. Typically a mid-level engineer can be relied upon to do the right thing with no/minimal oversight, can figure out incomplete instructions, and deliver quality results (and even train up the juniors on some things). At that point the only reason to have human junior engineers is so they can learn their way up the ladder to being an architect and responsible coordinating swarms of Claude Agents to develop whole applications and complete complex tasks and initiatives.
Beyond that what can Claude do... analyze the business and market as a whole and decide on product features, industry inefficiencies, gap analysis, and then define projects to address those and coordinate fleets of agents to change or even radically pivot an entire business?
I don't think we'll get to the point where all you have is a CEO and a massive Claude account but it's not completely science fiction the more I think about it.
[-]
- alfalfasprout 2 hours ago
  > I don't think we'll get to the point where all you have is a CEO and a massive Claude account but it's not completely science fiction the more I think about it.
  At that point, why do you even need the CEO?
  [-]
  - arjie 2 hours ago
    Reminds me of an old joke[0]:
    > The factory of the future will have only two employees, a man and a dog. The man will be there to feed the dog. The dog will be there to keep the man from touching the equipment.
    But really, the reason is that people like Pieter Levels do exist: masters at product vision and marketing. He also happens to be a proficient programmer, but there are probably other versions of him which are not programmers who will find the bar to product easier to meet now.
    0: https://quoteinvestigator.com/2022/01/30/future-factory/
    [-]
    - MrDunham 1 hour ago
      My technical cofounder reminds me of this story on a weekly basis.
  - jerf 1 hour ago
    You will need the CEO to watch over the AI and ensure that the interests of the company are being pursued and not the interests of the owners of the AI.
    That's probably the biggest threat to the long-term success of the AI industry; the inevitable pull towards encroaching more and more of their own interests into the AI themselves, driven by that Harvard Business School mentality we're all so familiar with, trying to "capture" more and more of the value being generated and leaving less and less for their customers, until their customer's full time job is ensuring the AIs are actually generating some value for them and not just the AI owner.
  - ako 2 hours ago
    And who does he sell his software to? Companies that have only 1 employee, don’t need a lot of user licenses for their employees…
    [-]
    - AshamedCaptain 1 hour ago
      What would be the point of selling software in such a world ? (where anyone could build any piece of software with a handful of keystrokes)
  - pixelready 1 hour ago
    The board (in theory) represents the interests of investors, and even with all of the other duties of a CEO stripped away, they will want a ringable neck / PR mouthpiece / fall guy for strategic missteps or publicly unpopular moves by the company. The managerial equivalent of having your hands on the driving wheel of a self-driving car.
  - mettamage 1 hour ago
    All of us are a CEO by that point.
    [-]
    - ArtificialAI 1 hour ago
      If everyone is, no one is.
      [-]
      - empath75 1 hour ago
        Wouldn't that be a good thing?
  - ceejayoz 2 hours ago
    As Steinbeck is often slightly misquoted:
    > Socialism never took root in America because the poor see themselves not as an exploited proletariat, but as temporarily embarrassed millionaires.
    Same deal here, but everyone imagines themselves as the billionaire CEO in charge of the perfectly compliant and effective AI.
  - tiku 1 hour ago
    For the network.
doug_durham 28 minutes ago
Did the author ask it to make new abstractions? In my experience when I produces output that I don't like I ask it to refactor it. These models have and understanding of all modern design patterns. Just ask it to adopt one.
[-]
- bblcla 23 minutes ago
  (Author here)
  I have! I agree it's very good at applying abstractions, if you know exactly what you want. What I notice is that Claude has almost no ability to surface those abstractions on its own.
  When I started having it write React, Claude produced incredibly buggy spaghetti code. I had to spend 3 weeks learning the fundamentals of React (how to use hooks, providers, stores, etc.) before I knew how to prompt it to write better code. Now that I've done that, it's great. But it's meaningful that someone who doesn't know how to write well-abstracted React code can't get Claude to produce it on their own.
  [-]
  - michalsustr 2 minutes ago
    Same experience here! As an analogy, consider the model knows both about arabic or roman number representations. But in alternate universe, it has been trained so much on roman numbers ("Bad Code") that it won't give you the arabic ones ("Good Code") unless you prompt it directly, even when they are clearly superior.
    I also believe that overall repository code quality is important for AI agents - the more "beautiful" it is, the more the agent can mimic the "beauty".
mklyachman 1 hour ago
Wow, what an excellent blog. Highly suggest trying out creator's tool (stardrift.ai) too!