PRANA — Vision-Action Policy for Robot Manipulation

Evolution

From L1 regression
to flow matching

Each version fixed the failures of the last. v1 taught me what doesn't work. v2 introduced the right inductive biases. v3 upgraded the eyes.

Version 1

Self-attention + L1 loss

All tokens in one self-attention pool. Deterministic regression learned the average of demos — a trajectory that matched none of them. Dead language encoder wasting 262MB. Unfrozen ViT overfitting on 83 episodes.

Version 2

Flow matching + cross-attention

Encoder-decoder split. Action queries cross-attend to visual context instead of competing with it. Flow matching with correlated noise handles multimodal actions. Frozen ViT-Tiny. Rolling inpainting for smooth chunk transitions.

Version 3

DINOv2 self-supervised vision

Same flow matching decoder as v2, but DINOv2 ViT-S/14 replaces ViT-Tiny. Self-supervised spatial features encode where objects are, not just what they are. 384d embeddings, 256 tokens per camera. Smoother, more generalizable.

Version 1 — Baseline

PRANA v1

The one that
almost worked

83 teleoperation episodes. ViT-Tiny vision encoder, PaliGemma tokenizer (receiving zeros), 4-layer self-attention transformer, 50-step action chunking with L1 loss. It reached for the screwdriver but couldn't grasp consistently.

The failures were instructive: self-attention let action queries contaminate each other before seeing the scene. Deterministic regression averaged out the multimodal demonstrations. The unfrozen backbone overfitted in 2000 steps.

v1 — reaches but struggles to grasp

Version 2 — Flow Matching

PRANA v2

Iterative denoising with
correlated noise

Instead of predicting one average trajectory, the model starts from noise shaped like real robot motions and iteratively refines it into precise actions. Cross-attention so the decoder actually looks at the scene before deciding what to do.

v2 architecture — context encoder + cross-attention action decoder with flow matching

Frozen ViT-Tiny with camera-ID embeddings
4-layer context encoder (self-attention over vision + state)
7-layer action decoder (cross-attention to context)
Flow matching: 10 Euler denoising steps at inference
Correlated noise from Cholesky decomposition of action covariance
Rolling inpainting: execute 40, save 10 for smooth transitions
~11M trainable params, 100K training steps
22ms per chunk inference on RTX 5060

v2 — picks up screwdriver and places in box

Version 3 — DINOv2 Vision

PRANA v3

Same decoder.
Better eyes.

DINOv2 ViT-S/14 replaces ViT-Tiny. Self-supervised pretraining on 142M images means the backbone encodes where objects are spatially, not just what class they are. 384-dimensional embeddings, 256 tokens per camera — 2.6x more visual information.

v3 architecture — DINOv2 backbone with 512 visual tokens feeding the same flow matching decoder

DINOv2 ViT-S/14 frozen backbone (22M params)
384d embeddings → 256 tokens per camera (vs 192d / 197 tokens)
512 total visual tokens (vs 394 in v2)
Same flow matching decoder, same correlated noise
Smoother trajectories, better generalization to object placement
~30M trainable params
Less jittery than v2, reduced position error at grasp point

v3 — smoother grasp, better generalization with DINOv2

Comparison

Side by side

Vision

ViT-Tiny (unfrozen)

ViT-Tiny (frozen)

DINOv2 ViT-S/14 (frozen)

Attention

Self-attention only

Cross-attention decoder

Action head

L1 regression

Flow matching (10 steps)

Noise

N/A

Correlated (β=0.5)

Visual tokens

394

512

Inference

~5ms

~22ms

~30ms

Grasps

With assistance

Yes

Yes, smoother

Places in box

No

With assistance

v2 — ViT-Tiny + flow matching

v3 — DINOv2 + flow matching

Links

Open source.
Everything.

Architecture, training scripts, deployment code, model weights, dataset, and the mistakes along the way.

GitHub → v2 Weights → Dataset →

PRANAThree iterations.One robot that grasps.

From L1 regressionto flow matching

The one thatalmost worked

Iterative denoising withcorrelated noise

Same decoder.Better eyes.