I rebuilt FlashAttention in Triton to understand the performance archaeology

(aminediro.com)

44 points | by amindiro 3 days ago

4 comments

amindiro 3 days ago
I’ve spent the last few weeks deconstructing FlashAttention. While the original paper is brilliant, I found that just reading it didn't give me a "gut feeling" for why certain engineering choices were made (the transition from v1 to v2).
I decided to rebuild it from scratch using Triton. This post is a chronicle of that journey—moving beyond the high-level algorithm and into the "performance archaeology" of the GPU:
- Profiling with Nsight Compute to find the real bottlenecks.
- Looking at the generated PTX and SASS code.
- Debugging shared memory bank conflicts and MIO bottlenecks.
- Iterating through the logic to see why tiling and online softmax are hardware-necessitated, not just mathematical tricks.
I’ve tried to keep it in the spirit of Simon Boehm’s matmul deep dive. Would love to hear from any GPU engineers on whether my interpretations of the SASS/bank conflict behavior match what you've seen in production.
rishabhaiover 26 minutes ago
I did an experiment on FlashAttention in Triton to measure the impact of caching tiles in the Shared Memory. Surprisingly, it had a non-monotonic relationship with prefetching these tiles and it was kernel dependent. Attention kernel benefits from prefetching caches while MLP W1 doesn't.
raphaelty 1 hour ago
Very interesting, wondering if there are other heavily used algorithm which could benefit a lot from a "Flash" version but don't have one today
npalli 1 hour ago
Seems very detailed and comprehensive. Did I miss it, but was there a performance comparison to the PyTorch version at the top?