Small things I'm building — visualizations of ideas I find interesting, interactive demos, and prototypes that aren't quite anything yet.
A Vision Transformer (ViT) turns an image into a sequence of tokens, then runs it through a standard transformer — no convolutions involved. The diagram below shows every stage at once. Step through the narrative, or hover any patch (or any bar) to trace one token's path through the whole pipeline.
Hover or tap any patch / bar / cell to trace one token's data through every stage. Click to pin.
Back-of-envelope total FLOPs and parameter count for a Vision Transformer given its architecture and input shape. Pick a standard preset, or tweak any value to see how the numbers shift.
FLOPs counted as multiply-accumulates — matches the convention in the ViT and DeiT papers. LayerNorm, softmax, GELU, and biases are dropped as negligible next to the matmuls. Heads is informational only: total FLOPs are the same as long as D = h · dₖ.