Congrats to the TigerBeetle team's new feature! and it looks like TB has already moved from shared nothing to partially shared disk (object storage) architecture. We are always a big fan of the tigerbeetle engineering and is actually using TB's excellent io_uring runtime [1] to build a new object storage, and this connection feels amazing to me.
Great video! Love the concepts discussed, and the tigerstyle mentioned by adityaathealye. This is such a hard problem.
For Halo 4, I architected a system for our telemetry
that that had to support subsecond roundtrip user lifetime aggregation. With up to 400ms one-way latencies, that left us with 200ms client + server time to manage lifetime counts. 60ms on the client for batching and network stack delays, and 80ms in the cloud for aggregation after routing delays.
On the client side we leveraged a lot of those TigerStyle ideas (static-allocation, simplified function surface area, explicit limits, strongly asserted). And for the cloud we effectively wrote Diagonal Scaling (Stage 7).
For gamedevs to use this, it had to be as lightweight as possible and introduce no risk of priority inversion or hitching. I wrote a many-reader/many-writer lockless circular queue with performance measured in the microseconds, it was a beast.
I was so proud of our tiny team for pulling off the vision. When we launched, there was a bug in the game: someone left in telemetry in a tight render loop during a cutsceen, so our 20k/s client-side throttles kicked in.
We hit an average of 700k/s on launch day, with peak ingestion throughput of 831k/s (2 weeks to hit a trillion). And we didn't break a sweat.
Partner teams that were trying to provide our reporting capabilities started slowing down, though, haha, and brought it to our attention ("We, uh, can't scale anymore, there's no more servers on the eastcoast data center we can allocate")... so we hit the killswitch on that category of event.
That was another piece I was proud of writing: each instrumentation point did a quick binary mask check for 64 categories and 64 subcategories to see if it should emit... one reason why the instrumentation times were so blazing fast, we had minimal branching based of a hotly-cached variable that would hang around L1 because it was touched so frequently.
Querying that day for insights, though... aggregation queries touching the launch day (that weren't on the per-user hotpath layer that was our primary use-case) would 30x query duration. XD
OH! And I forgot to mention. The tech was impressive enough, that it was quickly adopted as the backbone by every title that Microsoft published, from Gears of War to Solitaire, Forza to Minecraft.
And for comparison: The official, initial Xbox One telemetry stack used pipe-delimited strings, and supported 1-5 transactions/console/second (vs our 20k).
Thanks! It was definitely a highlight working on such bleeding-edge technologies with crazy smart people. Pouring over custom PowerPC CPU Errata was divine.
Oh, and we only had (if I recall) 1ms per frame on one core to do all our payload packaging and dequeue messages from the circular buffer. Thats' where the 20k/s hard limit came in... we could have handled SO much more. Our entire message usually landed around 100-150 bytes if I recall, using bitpacked structures.
One thing I didn't anticipate: Memory stomping would result in everyone pointing fingers at our department, because we would inevitably be the ones that would crash (usually with our hardening asserts). We had to start flagging our memory blocks as unwritable when our thread was idle during debug mode, so that offenders would crash when they touched our memory.
I think I need a deeper-dive into the "diagonal scaling" presented. From my understanding, this is actually no different from "industry decoupling" he disparages earlier in the presentation. There are even off-the-shelf libraries for LSMs backed by object storage like SlateDB.
The author of SlateDB, Chris Riccomini, is an angel investor in TigerBeetle.
However, TB here is also providing a Replicated State Machine with consensus and strict serializability, in front of object storage, to provide remote object storage capacity and recovery, but with local NVMe latency and without sacrificing consistency or durability.
TB navigates the entire design space, specializing for both hot and cold (transactional) data.
The more you zoom, it’s a stronger set of guarantees in terms of safety and performance.
I feel the Expression Problem neatly frames the "diagonal scaling" proposition; what system design choices will allow the architecture to scale vertically in what fashion, while also being able to scale what horizontally, without losing strict serialisability.
If we add a "vertical" capability, it cannot be at the cost of any existing "horizontal" capability, nor should doing so forfend any future "horizontal" capability. And vice-versa (adding horizontal capability should not mess with vertical ones). The point at which one will break the other is the theoretical design limit of the system.
in general these aren't in conflict. in particular once I have a system which can distribute work among faulty nodes and maintain serializability, exploiting parallelism _within_ a fault domain just falls out.
This was a team effort: the object storage connector, the scale test, the visualization, the slides, even provisioning the hardware had its challenges!
Oh most certainly; I am remiss to have not included the group effort in my comment, particularly as a person surrounded by theatre and film making friends.
Still, for the same reason, I have some idea of why their productions turn out well (or not). Where "well" is "a story well told", not "successful" as in "did well at the box office". The why is usually one person who keeps asking the questions and making the decisions that take the story from imagination to imagination via screen or floor.
Something tells me your doubtlessly excellent "production team" (in film terms) will agree with my original comment :)
This looks like yet another basic key value store.
Benchmarking is a complicated problem, but FoundationDB claims 300,000 reads per second on a single core. TigerBeetle claims 100k-500k TPS on... Some kind of hardware?
I have tried to use tiger beetle in production. haven't been successful yet.
nice stuff, multi master replication.
user API, super small.
doubts about how to do streaming backup.
after studying the API and doing some spike architectures I come to the conclusion (I may be wrong):
tiger beetle is awesome to keep the account balance. that's it.
because you pretty much get the transactions affecting and account and IIRC there was not a lot you can do about how to query them or use them.
also I was thinking it would be nice to have something like an account grouping other accounts to answer something like: how much money out user accounts have in this microsecond?
I think that was more or less about itm they have some special fields u128 to store ids to the transaction they represent into your actual system
and IIRC handle multi currency in different books
my conclusion was: I think I don't get it yet. I think I'm missing something. had to write a ruby client for it and build an UI to play with the API and do some transactions and see how it behaved. yet that was my conclusion
> how much money out user accounts have in this microsecond
My understanding is that if you want aggregations or sub accounts then you need to duplicate transactions and maintain them yourself. This may seem like it would be annoying, but I suspect it would mostly be a matter of code organization.
I expect typical TigerBeetle (TB) clients will develop and maintain an application-level library that builds up a set of TB transactions that encode each business-level transaction. For example, if a "business transaction" is "marking a purchase order as received", the corresponding set of TB transactions might include: 1. moving the received qty from the pending receipt qty account to the inventory qty account for the received line items, 2. adding the total cost to the inventory value account for this item, 3. adding the price to the Accounts Payable (AP) account for this vendor, 4. adding the shipping price to the AP account for the delivery company, etc. But then you might want some aggregations, so you'd do the same thing again and add the price to the "total inventory value" and "total accounts payable" accounts, etc.
In fact you might want 3, 4 or even more parallel ledgers at different levels of aggregation, which could all be maintained within the application library. I wonder if there's a name for this technique. My only concern is that if you break down your business transaction into fine grained detail like this and then duplicate it with multiple aggregations then that 8000 transaction limit starts looking a lot smaller.
To a first approximation, yes. But, why? And for up to how many hundred terabytes of data can you get away with the single beefy server? Provided you make what design choices?
> how many hundred terabytes of data can you get away with the single beefy server
You can have multiple petabytes on a single beefy server with excellent performance characteristics. This is already a thing.
The part of the database that doesn't scale with storage density is cache replacement algorithms, as a matter of theory. There are some problems a cache can't solve. These degrade long before you get to petabytes. If you replace cache replacement architectures with cache admission architectures then things scale just fine.
The main reason we use cache replacement architectures is that they are simple to design and understand. They do have fundamental limits though. Right tool for the job and all that.
To keep things simple. My current company is running multiple instances of back-end services for absolutely no fucking reason, and I had to fix numerous race condition bugs for them. I had an interview with a startup where, after I asked why they were using distributed DynamoDB locks in a monolith app with only a single instance running, the person said "it works for us" and got defensive. Later they told me I wasn't experienced enough. I am so frustrated that there appears to be zero basic engineering rigor anywhere I can find nowadays.
> And for up to how many hundred terabytes of data can you get away with the single beefy server?
Do you even need to store many hundred terabytes of data? I have never encountered a scenario in my career (admittedly not very long so far) where there was a need to store even one terabyte of data. But in case of TigerBeetle, from skimming through the video, it appears they offload the main bulk of data to a "remote storage."
If you think that my opinion is not worth listening to, i.e. that I am wrong, would you mind elaborating why? There is a real opportunity to sway my opinion here, because I am not unsure. I could just be crazy. I don't know anymore. But, generally, I don't think that you necessarily have to have multiple back-end instances, and that if you have multiple back-end instances, that you will necessarily have race conditions. Am I wrong in this?
Well, your opinion doesn't consider many real world factors. (It's also worth noting that you toned down your response to me with hedging language... "generally", "necessarily")
No matter how good a particular server it, it isn't immune to power outages, fiber cuts, fires, basic hardware failure or even just downtime for basic updates (os, application deployment, etc).
As soon as there is any sort of cost with downtime (direct or indirect monetary cost, or reputational cost), basic engineering rigor requires that you use redundancy to handle such failures, and that means spending money (in the form of vendors or engineering) on it. If money and time is being spent to create an application, and there is a reasonable assumption that downtime to that app will have costs to it, one way to amortize the cost of retrofitting redundancy into an app is to start with it.
Having multiple instances of the backend also allows for other cost saving measures:
* N instances of smaller sized server may be cheaper than 1 instance of really good server
* Multiple instances of backend allows for update deployments while the app is live, indirectly driving other cost savings (no overtime or pager pay, happier employess... "they don't give us snacks but we never have to work late", lower costs associated with downtime in due to botched deployments)
* Its cheaper to hire engineers that follow this de facto standard pattern than to sit down and pave new ground with other tools, and using that pattern they will achieve the desired result in terms of reliability and uptime.
* It allows for scaling if your app's traffic is seasonal, meaning you don't need to spend as much on resources as you scale (note this is the first time scaling has been brought up, and it's as a minor point).
Does every app have these concerns? Of course not. Do a very large number of apps have these concerns, or have an expected value calculation around these concerns that says the smart money is on planning for them? Yes.
In the context of a discussion of hardware being so good that a single bit of hardware can handle any load you're likely to throw at it - declaring multiple instances to be pointless is an analysis that hasn't considered any of the factors i brought up, in terms of a real cost analysis. Particularly if the complaint doesn't even introduce the concept of uptime as irrelevant to the app.
So generally is it "necessary" to have multiple instances of backend? i don't know, generally necessary is a very narrow scope. Is it bad to dismiss multiple backends as probably not needed - yes, its equally foolish to immediately requiring them. Proper engineering rigor requires considering not just the technical, but the financial realities of an application.
About race conditions... will multiple backends necessarily end up with them? No in the same way that lottery ticket won't necessarily be a loser. But multiple backends aren't required for race conditions... a single backend will presumably have concurrency within each instance too. And where there is concurrency, there almost certainly will be race conditions, I've rarely seen software that doesn't have them at some point or another. Maybe TigerBeetle doesn't and never will, but that is a unicorn team working on a very narrowly defined bit of software that is merely a component of other systems, working under conditions that are extremely expensive to reproduce for most engineering projects. The general case is that you will write, deploy and be frustrated by race conditions since the cost-benefit analysis doesn't call for absolute perfection... I know I have, odds are you've read about the consequences of it on this site at some point.
The point of all of this is that engineering rigor goes beyond technical rigor. It includes understanding the tradeoffs in terms of budget, uptime, technical decision making, speed of iteration and so on.
Boot it up again. You'll still have higher availability than AWS, GitHub, OpenAI, Anthropic, and many others.
> Where do you think those object storage live exactly?
On a RAID5 array with hot-swappable disks, of course.
(Edit to add: this is just a comment on Kubernetes being invoked whenever someone talks about scalability; I have massive respect for what the TigerBeetle folks are doing)
> (Edit to add: this is just a comment on Kubernetes being invoked whenever someone talks about scalability; I have massive respect for what the TigerBeetle folks are doing)
Me too. Why did you have to add this edit though? Is there anything that suggests either of us disrespects the TigerBeetle folks? I swear, I'm going crazy.
[1] https://codeberg.org/thomas-fractalbits/iofthetiger [2] https://fractalbits.com/blog/why-we-built-another-object-sto...
For Halo 4, I architected a system for our telemetry that that had to support subsecond roundtrip user lifetime aggregation. With up to 400ms one-way latencies, that left us with 200ms client + server time to manage lifetime counts. 60ms on the client for batching and network stack delays, and 80ms in the cloud for aggregation after routing delays.
On the client side we leveraged a lot of those TigerStyle ideas (static-allocation, simplified function surface area, explicit limits, strongly asserted). And for the cloud we effectively wrote Diagonal Scaling (Stage 7).
For gamedevs to use this, it had to be as lightweight as possible and introduce no risk of priority inversion or hitching. I wrote a many-reader/many-writer lockless circular queue with performance measured in the microseconds, it was a beast.
I was so proud of our tiny team for pulling off the vision. When we launched, there was a bug in the game: someone left in telemetry in a tight render loop during a cutsceen, so our 20k/s client-side throttles kicked in.
We hit an average of 700k/s on launch day, with peak ingestion throughput of 831k/s (2 weeks to hit a trillion). And we didn't break a sweat.
Partner teams that were trying to provide our reporting capabilities started slowing down, though, haha, and brought it to our attention ("We, uh, can't scale anymore, there's no more servers on the eastcoast data center we can allocate")... so we hit the killswitch on that category of event.
That was another piece I was proud of writing: each instrumentation point did a quick binary mask check for 64 categories and 64 subcategories to see if it should emit... one reason why the instrumentation times were so blazing fast, we had minimal branching based of a hotly-cached variable that would hang around L1 because it was touched so frequently.
Querying that day for insights, though... aggregation queries touching the launch day (that weren't on the per-user hotpath layer that was our primary use-case) would 30x query duration. XD
OH! And I forgot to mention. The tech was impressive enough, that it was quickly adopted as the backbone by every title that Microsoft published, from Gears of War to Solitaire, Forza to Minecraft.
And for comparison: The official, initial Xbox One telemetry stack used pipe-delimited strings, and supported 1-5 transactions/console/second (vs our 20k).
Oh, and we only had (if I recall) 1ms per frame on one core to do all our payload packaging and dequeue messages from the circular buffer. Thats' where the 20k/s hard limit came in... we could have handled SO much more. Our entire message usually landed around 100-150 bytes if I recall, using bitpacked structures.
One thing I didn't anticipate: Memory stomping would result in everyone pointing fingers at our department, because we would inevitably be the ones that would crash (usually with our hardening asserts). We had to start flagging our memory blocks as unwritable when our thread was idle during debug mode, so that offenders would crash when they touched our memory.
However, TB here is also providing a Replicated State Machine with consensus and strict serializability, in front of object storage, to provide remote object storage capacity and recovery, but with local NVMe latency and without sacrificing consistency or durability.
TB navigates the entire design space, specializing for both hot and cold (transactional) data.
The more you zoom, it’s a stronger set of guarantees in terms of safety and performance.
If we add a "vertical" capability, it cannot be at the cost of any existing "horizontal" capability, nor should doing so forfend any future "horizontal" capability. And vice-versa (adding horizontal capability should not mess with vertical ones). The point at which one will break the other is the theoretical design limit of the system.
This was a team effort: the object storage connector, the scale test, the visualization, the slides, even provisioning the hardware had its challenges!
Still, for the same reason, I have some idea of why their productions turn out well (or not). Where "well" is "a story well told", not "successful" as in "did well at the box office". The why is usually one person who keeps asking the questions and making the decisions that take the story from imagination to imagination via screen or floor.
Something tells me your doubtlessly excellent "production team" (in film terms) will agree with my original comment :)
Benchmarking is a complicated problem, but FoundationDB claims 300,000 reads per second on a single core. TigerBeetle claims 100k-500k TPS on... Some kind of hardware?
https://apple.github.io/foundationdb/benchmarking.html
nice stuff, multi master replication.
user API, super small.
doubts about how to do streaming backup.
after studying the API and doing some spike architectures I come to the conclusion (I may be wrong):
tiger beetle is awesome to keep the account balance. that's it.
because you pretty much get the transactions affecting and account and IIRC there was not a lot you can do about how to query them or use them.
also I was thinking it would be nice to have something like an account grouping other accounts to answer something like: how much money out user accounts have in this microsecond?
I think that was more or less about itm they have some special fields u128 to store ids to the transaction they represent into your actual system
and IIRC handle multi currency in different books
my conclusion was: I think I don't get it yet. I think I'm missing something. had to write a ruby client for it and build an UI to play with the API and do some transactions and see how it behaved. yet that was my conclusion
would be great to have an official UI client
My understanding is that if you want aggregations or sub accounts then you need to duplicate transactions and maintain them yourself. This may seem like it would be annoying, but I suspect it would mostly be a matter of code organization.
I expect typical TigerBeetle (TB) clients will develop and maintain an application-level library that builds up a set of TB transactions that encode each business-level transaction. For example, if a "business transaction" is "marking a purchase order as received", the corresponding set of TB transactions might include: 1. moving the received qty from the pending receipt qty account to the inventory qty account for the received line items, 2. adding the total cost to the inventory value account for this item, 3. adding the price to the Accounts Payable (AP) account for this vendor, 4. adding the shipping price to the AP account for the delivery company, etc. But then you might want some aggregations, so you'd do the same thing again and add the price to the "total inventory value" and "total accounts payable" accounts, etc.
In fact you might want 3, 4 or even more parallel ledgers at different levels of aggregation, which could all be maintained within the application library. I wonder if there's a name for this technique. My only concern is that if you break down your business transaction into fine grained detail like this and then duplicate it with multiple aggregations then that 8000 transaction limit starts looking a lot smaller.
https://docs.tigerbeetle.com/operating/cdc/
Which leads to the real takeaway which is "Tiger Style": https://tigerstyle.dev/ which I am partial to, along with Rich Hickey's "Hammock Driven Development" https://www.youtube.com/watch?v=f84n5oFoZBc
"Tiger on Hammock" will absolutely smoke the competition.
(edit: add links)
You can have multiple petabytes on a single beefy server with excellent performance characteristics. This is already a thing.
The part of the database that doesn't scale with storage density is cache replacement algorithms, as a matter of theory. There are some problems a cache can't solve. These degrade long before you get to petabytes. If you replace cache replacement architectures with cache admission architectures then things scale just fine.
The main reason we use cache replacement architectures is that they are simple to design and understand. They do have fundamental limits though. Right tool for the job and all that.
To keep things simple. My current company is running multiple instances of back-end services for absolutely no fucking reason, and I had to fix numerous race condition bugs for them. I had an interview with a startup where, after I asked why they were using distributed DynamoDB locks in a monolith app with only a single instance running, the person said "it works for us" and got defensive. Later they told me I wasn't experienced enough. I am so frustrated that there appears to be zero basic engineering rigor anywhere I can find nowadays.
> And for up to how many hundred terabytes of data can you get away with the single beefy server?
Do you even need to store many hundred terabytes of data? I have never encountered a scenario in my career (admittedly not very long so far) where there was a need to store even one terabyte of data. But in case of TigerBeetle, from skimming through the video, it appears they offload the main bulk of data to a "remote storage."
You'd still be fixing race conditions though.
No matter how good a particular server it, it isn't immune to power outages, fiber cuts, fires, basic hardware failure or even just downtime for basic updates (os, application deployment, etc).
As soon as there is any sort of cost with downtime (direct or indirect monetary cost, or reputational cost), basic engineering rigor requires that you use redundancy to handle such failures, and that means spending money (in the form of vendors or engineering) on it. If money and time is being spent to create an application, and there is a reasonable assumption that downtime to that app will have costs to it, one way to amortize the cost of retrofitting redundancy into an app is to start with it.
Having multiple instances of the backend also allows for other cost saving measures:
* N instances of smaller sized server may be cheaper than 1 instance of really good server
* Multiple instances of backend allows for update deployments while the app is live, indirectly driving other cost savings (no overtime or pager pay, happier employess... "they don't give us snacks but we never have to work late", lower costs associated with downtime in due to botched deployments)
* Its cheaper to hire engineers that follow this de facto standard pattern than to sit down and pave new ground with other tools, and using that pattern they will achieve the desired result in terms of reliability and uptime.
* It allows for scaling if your app's traffic is seasonal, meaning you don't need to spend as much on resources as you scale (note this is the first time scaling has been brought up, and it's as a minor point).
Does every app have these concerns? Of course not. Do a very large number of apps have these concerns, or have an expected value calculation around these concerns that says the smart money is on planning for them? Yes.
In the context of a discussion of hardware being so good that a single bit of hardware can handle any load you're likely to throw at it - declaring multiple instances to be pointless is an analysis that hasn't considered any of the factors i brought up, in terms of a real cost analysis. Particularly if the complaint doesn't even introduce the concept of uptime as irrelevant to the app.
So generally is it "necessary" to have multiple instances of backend? i don't know, generally necessary is a very narrow scope. Is it bad to dismiss multiple backends as probably not needed - yes, its equally foolish to immediately requiring them. Proper engineering rigor requires considering not just the technical, but the financial realities of an application.
About race conditions... will multiple backends necessarily end up with them? No in the same way that lottery ticket won't necessarily be a loser. But multiple backends aren't required for race conditions... a single backend will presumably have concurrency within each instance too. And where there is concurrency, there almost certainly will be race conditions, I've rarely seen software that doesn't have them at some point or another. Maybe TigerBeetle doesn't and never will, but that is a unicorn team working on a very narrowly defined bit of software that is merely a component of other systems, working under conditions that are extremely expensive to reproduce for most engineering projects. The general case is that you will write, deploy and be frustrated by race conditions since the cost-benefit analysis doesn't call for absolute perfection... I know I have, odds are you've read about the consequences of it on this site at some point.
The point of all of this is that engineering rigor goes beyond technical rigor. It includes understanding the tradeoffs in terms of budget, uptime, technical decision making, speed of iteration and so on.
Kubernetes is not just for scaling, it's a way to standardize all ops.
Boot it up again. You'll still have higher availability than AWS, GitHub, OpenAI, Anthropic, and many others.
> Where do you think those object storage live exactly?
On a RAID5 array with hot-swappable disks, of course.
(Edit to add: this is just a comment on Kubernetes being invoked whenever someone talks about scalability; I have massive respect for what the TigerBeetle folks are doing)
Me too. Why did you have to add this edit though? Is there anything that suggests either of us disrespects the TigerBeetle folks? I swear, I'm going crazy.