Studio — Linda Wang

STUDIO —SKETCHES
& PROTOTYPES

Small things I'm building — visualizations of ideas I find interesting, interactive demos, and prototypes that aren't quite anything yet.

01 —VIT,
EXPLORED

A Vision Transformer (ViT) turns an image into a sequence of tokens, then runs it through a standard transformer — no convolutions involved. The diagram below shows every stage at once. Step through the narrative, or hover any patch (or any bar) to trace one token's path through the whole pipeline.

Image

H × W × 3

→

flatten
+ project
+ pos enc

Embedding X

(N+1) × D

→

× W_Q,K,V

Q

queries

K

keys

V

values

→

softmax
(Q·K^T/√d_k)

Attention A

(N+1) × (N+1)

→

A · V

Attn out

A · V

→

feed-forward

MLP

Linear
D → 4D GELU Linear
4D → D

feed-forward

→

Block out

(N+1) × D

× N LAYERS

Image · H × W × 3

↓

linear projection · + positional encoding · prepend [CLS]

Patch + position embeddings sequence X · (N+1) × D

↓

LayerNorm

↓

Multi-Head Attention h heads · each runs Q·K·V → softmax → A·V

↓

+

↓

LayerNorm

↓

MLP Linear → GELU → Linear

↓

+

× N blocks

↓

Contextualized sequence [CLS] + N patch tokens

↓

[CLS] token readout

↓

LayerNorm · MLP head Linear → softmax over classes

↓

Class probabilities

Step 1 of 8 · Image

Hover or tap any patch / bar / cell to trace one token's data through every stage. Click to pin.

02 —VIT FLOPS
CALCULATOR

Back-of-envelope total FLOPs and parameter count for a Vision Transformer given its architecture and input shape. Pick a standard preset, or tweak any value to see how the numbers shift.

Preset

Image H × Image W Patch P

Embed D Layers N Heads h MLP ratio

Sequence length L: —
Patch embedding: —
Attention (×N): —
MLP (×N): —
Parameters: —

FLOPs / forward —

FLOPs counted as multiply-accumulates — matches the convention in the ViT and DeiT papers. LayerNorm, softmax, GELU, and biases are dropped as negligible next to the matmuls. Heads is informational only: total FLOPs are the same as long as D = h · dₖ.