I'm biased by my preferred style of programming languages but I think that pure statically typed functional languages are incredibly well suited for LLMs. The purity gives you referential transparency and static analysis powers that the LLM can leverage to stay correctly on task.
The high level declarative nature and type driven development style of languages like Haskell also make it really easy for an experienced developer to review and validate the output of the LLM.
Early on in the GPT era I had really bad experiences generating Haskell code with LLMs but I think that the combination of improved models, increased context size, and agentic tooling has allowed LLMs to really take advantage of functional programming.
You are right that there is significantly more Javascript in the training data, but I can say from experience that I'm a little shocked at how well opus 4.5 has been for me writing Haskell. I'm fairly particular and I end up re-writing a lot of code for style reasons but it can often one shot an acceptable solution that is mostly inline with the rest of the code base.
I've also had decent experiences with Rust recently. I haven't done enough Haskell programming in the AI era to really say.
But it could be that different programming languages are a bit like different human languages for these models: when they have more than some threshold of training data, they can express their general problem solving skills in any of them? And then it's down to how much the compiler and linters can yell at them.
For Rust, I regularly tell them to make `clippy::pedantic` happy (and tell me explicitly when they think that the best way to do that is via an explicit ignore annotation in the code to disable a certain warning for a specific line).
Pedantic clippy is usually too.. pedantic for humans, but it seems to work reasonably well with the agents. You can also add clippy::cargo which ain't included in clippy::pedantic.
> But it could be that different programming languages are a bit like different human languages for these models: when they have more than some threshold of training data, they can express their general problem solving skills in any of them? And then it's down to how much the compiler and linters can yell at them.
This is kind of just a measurement of how representative a language is in the distribution of the tokenizer training. You could have a single token equal to “public static void main”.
Realistically, it’s also a function of how many iterations it takes for an AI agent to correctly solve a problem with a given language. I’d imagine most AI agents would frequently have to redo J or F# code, as they are fairly uncommon languages with much smaller training set than JavaScript or Python.
I can say that for F# this has been mostly true up until quite recently. We use F# at work and were mostly unable to use agents like Claude Code up until the release of Opus 4.5, which seems to know F# quite well.
This is interesting research; thank you for doing it.
I am not sure token efficiency is an interesting problem in the long term, though.
And in the short term I wonder if prompts could be pre-compiled to “compressed tokens”; the idea would be to use a smaller number of tokens to represent a frequently needed concept; kind of like LZ compression. Or maybe token compression becomes a feature of future models optimized for specific tasks.
I was wondering last year if it would be worthwhile trying to create a language that was especially LLM-friendly, eg that embedded more context in the language structure. The idea is to make more of the program and the thinking behind it, explicit to the LLM but in a programming language style to eliminate the ambiguity of natural language (one could just use comments).
Then it occurred to me that with current LLM training methodology that there’s a chicken-and-egg problem; it doesn’t start to show rewards until there is a critical mass of good code in the language for LLMs to train on.
It strikes me that more tokens likely give the LLM more time/space to "think". Also that more redundant tokens, like local type declarations instead of type inference from far away, likely often reduce the portion of the code LLMs (and humans) have to read.
So I'm not convinced this is either the right metric, or even if you got the right metric that it's a metric you want to minimize.
With Chain of Thoughts (text thinking), the models can already use as much compute as they want in any language (determined by reinforcement learning training)
I'm not convinced that thinking tokens - which sort of have to serve a specific chain of thought purpose - are interchangeable with input tokens during which give the model compute without having it add new text.
For a very imperfect human analogy, it feels like saying "a student can spend as much time thinking about the text as they want, so the textbook can be extremely terse".
Definitely just gut feelings though - not well tested or anything. I could be wrong.
I would expect that we’ll end up compressing (or whatever term you would use) this at some point so many of those syntactical differences will not be as significant.
But I would love for more expressive and compact languages to do better, selfish as I am. But I think training data size is more of a factor, and we won’t be all moving up Clojure any time soon.
I suspect DB queries will also benefit from token-efficient query languages as RAG queries grow exponentially. I've been working on one such language that is emitted in a token-efficient IR and compiles to SQL. https://memelang.net/
I knew it without the reading. But having each system call in 2 versions not even closely related to each other (monadic/diadic) requires me to have a hard time doing learning. I very appreciate this language for shortness but this kind of shortness might annoy.
I guess it also depends on which dataset LLM was trained on. Rare or niche languages get fragmented into more tokens even if the code itself is short. So two languages with the same number of characters can produce very different token counts because one aligns with what the model has seen millions of times and the other does not.
I don't think context size is really the limit for larger codebases - it's more about how you use that context.
Claude Code makes some efforts to reduce context size, but at the end of the day is loading entire source files into context (then keeping them there until told to remove them, or context is compacted). One of the major wins is to run subagents for some tasks, that use their own context rather than loading more into CCs own context.
Cursor makes more efficient use of context by building a vector database of code chunks, then only loading matching chunks into context (I believe it does this for Composer/agentic use as well as for tab/autocomplete).
One of the more obvious ways to reduce context use in a larger multi-module codebase would be to take advantage of the split between small module definition (e.g. C++ .h files) and large module implementations (.cpp files). Generally you'd only need to load module interfaces/definitions into context if you are working on code that uses the module, and Cursor's chunked approach can reduce that further.
For whole codebase overview a language server can help locate things, and one could use the AI to itself generate shortish summaries/overviews of source files and the codebase and structure, similar to what a human developer might keep in their head, rather than repeatedly reading entire source files for code that isn't actually being modified.
It seems we're really in the early days of agentic coding tools, and they have a lot of room to get better and more efficient.
The approaches used by Claude Code and Cursor are inefficient. It's possible to calculate a covering set for a piece of code and provide that to an agent directly via a tool, and it turns out that this can reduce context usage in SWE-bench style tasks by >90% over RAG and grep/read.
Like most LLM-made readme's and the six bajillion AI/agentic/llm tools now on Github I can barely get a grasp on what I'm looking at here, or how to use it practically.
> Smart code bundler that turns repositories into optimized code bundles meeting a token budget in milliseconds
Ok. So it's a tool, do I use it on my repo once? Then what? Do I use it as I go, do it sit accessible to something like Claude Code and the onus is on me to direct Claude to use this to search files ? I can see some CLI examples, what should I do with that where does that fit into what people are using with cursor / claude / gemini etc ?
This is the part I've been trying to hammer home about LLM created stuff. It leaves us with vague not well-understood outcomes that might do something, and I'm not against creating tools with LLM's but I'm actually pretty against people creating the basic readme with LLM's. We need humans in here telling other humans how to use it, because LLMs flat out lose the plot over the course of a large project and I think a big issue is LLM's can sometimes be more eloquent at writing than a lot of people can, so they opt for the LLM-generated readme.
But as someone who would maybe consider using something like this, I see that readme and it just looks like every claude code thing I've put together to date which is to say I've done some seemingly impossible things with Claude only to find that his ability to recap the entirety of it just ended up in a whole lot of seemingly meaningful words and phrases and sentences that actually paint a super disjointed picture of what exactly a repo is about.
I'm finding that I have to share more and more code to ensure that various standards are being kept.
For example I shared some Model code with Claude and Gemini (both via web interfaces) and they both tried to put Controller code into the Model, despite me multiple times telling them that the code wasn't wanted nor needed in there.
I had to (eventually) share the entire project with the models (despite them having been working with the code all along) before they would comply with my request (whilst also congratulating me on my far superior architecture..)
That costs more tokens for each problem than just saying "her look at this section and work toward this goal"
The high level declarative nature and type driven development style of languages like Haskell also make it really easy for an experienced developer to review and validate the output of the LLM.
Early on in the GPT era I had really bad experiences generating Haskell code with LLMs but I think that the combination of improved models, increased context size, and agentic tooling has allowed LLMs to really take advantage of functional programming.
But it could be that different programming languages are a bit like different human languages for these models: when they have more than some threshold of training data, they can express their general problem solving skills in any of them? And then it's down to how much the compiler and linters can yell at them.
For Rust, I regularly tell them to make `clippy::pedantic` happy (and tell me explicitly when they think that the best way to do that is via an explicit ignore annotation in the code to disable a certain warning for a specific line).
Pedantic clippy is usually too.. pedantic for humans, but it seems to work reasonably well with the agents. You can also add clippy::cargo which ain't included in clippy::pedantic.
I think this is exactly right.
`public` might have a token by itself, even though you can have `pub` occurring in other contexts, too.
Seeing all the C languages and JavaScript at the bottom like this makes me wonder if it's not just that Curly brackets take a lot of tokens.
But had never considered that a programming language might be created thats less human readable/auditable to enable LLMs.
Scares me a bit.
We're not building a language for LLMs just yet.
I am not sure token efficiency is an interesting problem in the long term, though.
And in the short term I wonder if prompts could be pre-compiled to “compressed tokens”; the idea would be to use a smaller number of tokens to represent a frequently needed concept; kind of like LZ compression. Or maybe token compression becomes a feature of future models optimized for specific tasks.
I was wondering last year if it would be worthwhile trying to create a language that was especially LLM-friendly, eg that embedded more context in the language structure. The idea is to make more of the program and the thinking behind it, explicit to the LLM but in a programming language style to eliminate the ambiguity of natural language (one could just use comments).
Then it occurred to me that with current LLM training methodology that there’s a chicken-and-egg problem; it doesn’t start to show rewards until there is a critical mass of good code in the language for LLMs to train on.
So I'm not convinced this is either the right metric, or even if you got the right metric that it's a metric you want to minimize.
For a very imperfect human analogy, it feels like saying "a student can spend as much time thinking about the text as they want, so the textbook can be extremely terse".
Definitely just gut feelings though - not well tested or anything. I could be wrong.
Because that’s what happened in the real world when generating a bunch of untyped Python code.
But I would love for more expressive and compact languages to do better, selfish as I am. But I think training data size is more of a factor, and we won’t be all moving up Clojure any time soon.
[1] https://www.jsoftware.com/
E.g. when it comes to authoring code, C, which comes language, is by far one of the languages that LLMs excel most at.
Claude Code makes some efforts to reduce context size, but at the end of the day is loading entire source files into context (then keeping them there until told to remove them, or context is compacted). One of the major wins is to run subagents for some tasks, that use their own context rather than loading more into CCs own context.
Cursor makes more efficient use of context by building a vector database of code chunks, then only loading matching chunks into context (I believe it does this for Composer/agentic use as well as for tab/autocomplete).
One of the more obvious ways to reduce context use in a larger multi-module codebase would be to take advantage of the split between small module definition (e.g. C++ .h files) and large module implementations (.cpp files). Generally you'd only need to load module interfaces/definitions into context if you are working on code that uses the module, and Cursor's chunked approach can reduce that further.
For whole codebase overview a language server can help locate things, and one could use the AI to itself generate shortish summaries/overviews of source files and the codebase and structure, similar to what a human developer might keep in their head, rather than repeatedly reading entire source files for code that isn't actually being modified.
It seems we're really in the early days of agentic coding tools, and they have a lot of room to get better and more efficient.
If you're interested in learning more, https://github.com/sibyllinesoft/scribe
> Smart code bundler that turns repositories into optimized code bundles meeting a token budget in milliseconds
Ok. So it's a tool, do I use it on my repo once? Then what? Do I use it as I go, do it sit accessible to something like Claude Code and the onus is on me to direct Claude to use this to search files ? I can see some CLI examples, what should I do with that where does that fit into what people are using with cursor / claude / gemini etc ?
This is the part I've been trying to hammer home about LLM created stuff. It leaves us with vague not well-understood outcomes that might do something, and I'm not against creating tools with LLM's but I'm actually pretty against people creating the basic readme with LLM's. We need humans in here telling other humans how to use it, because LLMs flat out lose the plot over the course of a large project and I think a big issue is LLM's can sometimes be more eloquent at writing than a lot of people can, so they opt for the LLM-generated readme.
But as someone who would maybe consider using something like this, I see that readme and it just looks like every claude code thing I've put together to date which is to say I've done some seemingly impossible things with Claude only to find that his ability to recap the entirety of it just ended up in a whole lot of seemingly meaningful words and phrases and sentences that actually paint a super disjointed picture of what exactly a repo is about.
For example I shared some Model code with Claude and Gemini (both via web interfaces) and they both tried to put Controller code into the Model, despite me multiple times telling them that the code wasn't wanted nor needed in there.
I had to (eventually) share the entire project with the models (despite them having been working with the code all along) before they would comply with my request (whilst also congratulating me on my far superior architecture..)
That costs more tokens for each problem than just saying "her look at this section and work toward this goal"