Linda Wang
2026 Edition
STUDIO —SKETCHES
& PROTOTYPES

Small things I'm building — visualizations of ideas I find interesting, interactive demos, and prototypes that aren't quite anything yet.

01 —VIT,
EXPLORED

A Vision Transformer (ViT) turns an image into a sequence of tokens, then runs it through a standard transformer — no convolutions involved. The diagram below shows every stage at once. Step through the narrative, or hover any patch (or any bar) to trace one token's path through the whole pipeline.

Image
H × W × 3
flatten
+ project
+ pos enc
Embedding X
(N+1) × D
× WQ,K,V
Q
queries
K
keys
V
values
softmax
(Q·KT/√dk)
Attention A
(N+1) × (N+1)
A · V
Attn out
A · V
feed-forward
MLP
Linear
D → 4D
GELU Linear
4D → D
feed-forward
Block out
(N+1) × D
× N LAYERS
Image · H × W × 3
linear projection · + positional encoding · prepend [CLS]
Patch + position embeddings sequence X · (N+1) × D
Transformer encoder block
LayerNorm
Multi-Head Attention h heads · each runs Q·K·V → softmax → A·V
+
LayerNorm
MLP Linear → GELU → Linear
+
residual residual
× N blocks
Contextualized sequence [CLS] + N patch tokens
[CLS] token readout
LayerNorm · MLP head Linear → softmax over classes
Class probabilities

Step 1 of 8 · Image

Hover or tap any patch / bar / cell to trace one token's data through every stage. Click to pin.

02 —VIT FLOPS
CALCULATOR

Back-of-envelope total FLOPs and parameter count for a Vision Transformer given its architecture and input shape. Pick a standard preset, or tweak any value to see how the numbers shift.

×
Sequence length L
Patch embedding
Attention (×N)
MLP (×N)
Parameters
FLOPs / forward

FLOPs counted as multiply-accumulates — matches the convention in the ViT and DeiT papers. LayerNorm, softmax, GELU, and biases are dropped as negligible next to the matmuls. Heads is informational only: total FLOPs are the same as long as D = h · dₖ.