> resulting VM outperforms both my previous Rust implementation and my hand-coded ARM64 assembly
it's always surprising for me how absurdly efficient "highly specialized VM/instruction interpreters" are
like e.g. two independent research projects into how to have better (fast, more compact) serialization in rust ended up with something like a VM/interpreter for serialization instructions leading to both higher performance and more compact code size while still being cable of supporting similar feature sets as serde(1)
(in general monomorphisation and double dispatch (e.g. serde) can bring you very far, but the best approach is like always not the extrem. Neither allays monomorphisation nor dynamic dispatch but a balance between taking advantage of the strength of both. And specialized mini VMs are in a certain way an extra flexible form of dynamic dispatch.)
---
(1): More compact code size on normal to large project, not necessary on micro projects as the "fixed overhead" is often slightly larger while the per serialization type/protocol overhead can be smaller.
(1b): They have been experimental research project, not sure if any of them got published to GitHub, non are suited for usage in production or similar.
A new Go protobuf parser [1] made the rounds here eight months ago [2] with a specialized VM that outperforms the default generated protobuf code by 3x.
It doesn't make sense to me that an embedded VM/interpreter could ever outperform direct code
You're adding a layer of abstraction and indirection, so how is it possible that a more indirect solution can have better performance?
This seems counterintuitive, so I googled it. Apparently, it boils down to instruction cache efficiency and branch prediction, largely. The best content I could find was this post, as well as some scattered comments from Mike Pall of LuaJIT fame:
Interestingly, this is also discussed on a similar blogpost about using Clang's recent-ish [[musttail]] tailcall attribute to improve C++ JSON parsing performance:
> It doesn't make sense to me that an embedded VM/interpreter could ever outperform direct code. You're adding a layer of abstraction and indirection, so how is it possible that a more indirect solution can have better performance?
It is funny, but (like I’ve already mentioned[1] a few months ago) for serialization(-adjacent) formats in particular the preferential position of bytecode interpreters has been rediscovered again and again.
The earliest example I know about is Microsoft’s MIDL, which started off generating C code for NDR un/marshalling but very soon (ca. 1995) switched to bytecode programs (which Microsoft for some reason called “format strings”; these days there’s also typelib marshalling and WinRT metadata-driven marshalling, the latter completely undocumented, but both data-driven). Bellard’s nonfree ffasn1 also (seemingly) uses bytecode, unlike the main FOSS implementations of ASN.1. Protocol Buffers started off with codegen (burying Google user in de/serialization code) but UPB uses “table-driven”, i.e. bytecode, parsing[2].
The most interesting chapter in this long history is in my opinion Swift’s bytecode-based value witnesses[3,4]. Swift (uniquely) has support for ABI compatibility with polymorphic value types, so e.g. you can have a field in the middle of your struct whose size and alignment only become known at dynamic linking time. It does this in pretty much the way you expect[5] (and the same way IBM’s SOM did inheritance across ABI boundaries decades ago): each type has a vtable (“value witness”) full of compiler-generated methods like size, alignment, copy, move, etc., which for polymorphic type instances will call the type arguments’ witness methods and compute on the results. Anyways, here too the story is that they started with native codegen, got buried under the generated code, and switched to bytecode instead. (I wonder—are they going to PGO and JIT next, like hyperpb[6] for Protobuf? Also, bytecode-based serde when?)
The article explains most of this, but the key takeaway for beginners once this lands is: With `become` you can write tail calls in Rust and it will promise they either work or don't compile, you can't have the case (which exists in several languages) where you thought you'd written a tail call but you hadn't (or maybe you had but you switched to a different compiler or made a seemingly inconsequential change to the code) and now the stack has overflowed.
Rust has been really good at providing ergonomic support for features we're too used to seeing provided as "Experts only" features with correspondingly poor UX.
More accurate title would be to say it is a tail call optimized interpreter. Tail calls alone aren't special b/c what matters is that the compiler or runtime properly reuses caller's frame instead of pushing another call frame & growing the stack.
Maybe, it probably depends on how you're looking at it. The optimization is obvious, I expect any optimizing compiler will TCO all naive tail calls - but the trouble in Rust or C++ or a dozen other languages is that you can so easily write code which you think can be optimized but the compiler either can't see how or can see that it's not possible and (without this keyword) you don't find out about this because growing the stack is a valid implementation of what you wrote even though it's not what you meant.
The "become" keyword allows us to express our meaning, we want the tail call, and, duh, of course the compiler will optimize that if it can be a tail call but also now the compiler is authorized to say "Sorry Dave, that's not possible" rather than grow the stack. Most often you wrote something silly. "Oh, the debug logging happens after the call, that's never going to work, I will shuffle things around".
I wouldn't call it optimized, since that implies that it gains performance due to the tail calls and would work otherwise, but the tail calls are integral to the function of the interpreter. It simply wouldn't work if the compiler can't be forced to emit them.
> Tail calls can be implemented without adding a new stack frame to the call stack. Most of the frame of the current procedure is no longer needed, and can be replaced by the frame of the tail call, modified as appropriate (similar to overlay for processes, but for function calls). The program can then jump to the called subroutine. Producing such code instead of a standard call sequence is called *tail-call elimination* or *tail-call optimization*. (https://en.wikipedia.org/wiki/Tail_call)
it's always surprising for me how absurdly efficient "highly specialized VM/instruction interpreters" are
like e.g. two independent research projects into how to have better (fast, more compact) serialization in rust ended up with something like a VM/interpreter for serialization instructions leading to both higher performance and more compact code size while still being cable of supporting similar feature sets as serde(1)
(in general monomorphisation and double dispatch (e.g. serde) can bring you very far, but the best approach is like always not the extrem. Neither allays monomorphisation nor dynamic dispatch but a balance between taking advantage of the strength of both. And specialized mini VMs are in a certain way an extra flexible form of dynamic dispatch.)
---
(1): More compact code size on normal to large project, not necessary on micro projects as the "fixed overhead" is often slightly larger while the per serialization type/protocol overhead can be smaller.
(1b): They have been experimental research project, not sure if any of them got published to GitHub, non are suited for usage in production or similar.
[1]: https://mcyoung.xyz/2025/07/16/hyperpb/
[2]: https://news.ycombinator.com/item?id=44591605
You're adding a layer of abstraction and indirection, so how is it possible that a more indirect solution can have better performance?
This seems counterintuitive, so I googled it. Apparently, it boils down to instruction cache efficiency and branch prediction, largely. The best content I could find was this post, as well as some scattered comments from Mike Pall of LuaJIT fame:
https://sillycross.github.io/2022/11/22/2022-11-22/
Interestingly, this is also discussed on a similar blogpost about using Clang's recent-ish [[musttail]] tailcall attribute to improve C++ JSON parsing performance:
https://blog.reverberate.org/2021/04/21/musttail-efficient-i...
It is funny, but (like I’ve already mentioned[1] a few months ago) for serialization(-adjacent) formats in particular the preferential position of bytecode interpreters has been rediscovered again and again.
The earliest example I know about is Microsoft’s MIDL, which started off generating C code for NDR un/marshalling but very soon (ca. 1995) switched to bytecode programs (which Microsoft for some reason called “format strings”; these days there’s also typelib marshalling and WinRT metadata-driven marshalling, the latter completely undocumented, but both data-driven). Bellard’s nonfree ffasn1 also (seemingly) uses bytecode, unlike the main FOSS implementations of ASN.1. Protocol Buffers started off with codegen (burying Google user in de/serialization code) but UPB uses “table-driven”, i.e. bytecode, parsing[2].
The most interesting chapter in this long history is in my opinion Swift’s bytecode-based value witnesses[3,4]. Swift (uniquely) has support for ABI compatibility with polymorphic value types, so e.g. you can have a field in the middle of your struct whose size and alignment only become known at dynamic linking time. It does this in pretty much the way you expect[5] (and the same way IBM’s SOM did inheritance across ABI boundaries decades ago): each type has a vtable (“value witness”) full of compiler-generated methods like size, alignment, copy, move, etc., which for polymorphic type instances will call the type arguments’ witness methods and compute on the results. Anyways, here too the story is that they started with native codegen, got buried under the generated code, and switched to bytecode instead. (I wonder—are they going to PGO and JIT next, like hyperpb[6] for Protobuf? Also, bytecode-based serde when?)
[1] https://news.ycombinator.com/item?id=44665671, I’m too lazy to copy over the links so refer there for the missing references.
[2] https://news.ycombinator.com/item?id=44664592 and parent’s second link.
[3] https://forums.swift.org/t/sr-14273-byte-code-based-value-wi...
[4] Rexin, “Compact value witnesses in Swift”, 2023 LLVM Dev. Mtg., https://www.youtube.com/watch?v=hjgDwdGJIhI
[5] Pestov, McCall, “Implementing Swift generics”, 2017 LLVM Dev. Mtg., https://www.youtube.com/watch?v=ctS8FzqcRug
[6] https://mcyoung.xyz/2025/07/16/hyperpb/
Tail recursion opens up for people to write really really neat looping facilities using macros.
https://doc.rust-lang.org/std/keyword.become.html
Rust has been really good at providing ergonomic support for features we're too used to seeing provided as "Experts only" features with correspondingly poor UX.
> Last week, I wrote a tail-call interpreter using the become keyword, which was recently added to nightly Rust (seven months ago is recent, right?).
The "become" keyword allows us to express our meaning, we want the tail call, and, duh, of course the compiler will optimize that if it can be a tail call but also now the compiler is authorized to say "Sorry Dave, that's not possible" rather than grow the stack. Most often you wrote something silly. "Oh, the debug logging happens after the call, that's never going to work, I will shuffle things around".
> Tail calls can be implemented without adding a new stack frame to the call stack. Most of the frame of the current procedure is no longer needed, and can be replaced by the frame of the tail call, modified as appropriate (similar to overlay for processes, but for function calls). The program can then jump to the called subroutine. Producing such code instead of a standard call sequence is called *tail-call elimination* or *tail-call optimization*. (https://en.wikipedia.org/wiki/Tail_call)