Vision-Action Policy   2025–2026

PRANA
Three iterations.
One robot that grasps.

From a deterministic baseline that barely reached the screwdriver, to a flow-matching policy with correlated noise that picks it up and places it in the box. Built on a $300 arm with one laptop GPU.

3
Architectures
123
Episodes
10 Hz
Control rate
8 GB
Single GPU
Evolution

From L1 regression
to flow matching

Each version fixed the failures of the last. v1 taught me what doesn't work. v2 introduced the right inductive biases. v3 upgraded the eyes.

Version 1
Self-attention + L1 loss
All tokens in one self-attention pool. Deterministic regression learned the average of demos — a trajectory that matched none of them. Dead language encoder wasting 262MB. Unfrozen ViT overfitting on 83 episodes.
Version 2
Flow matching + cross-attention
Encoder-decoder split. Action queries cross-attend to visual context instead of competing with it. Flow matching with correlated noise handles multimodal actions. Frozen ViT-Tiny. Rolling inpainting for smooth chunk transitions.
Version 3
DINOv2 self-supervised vision
Same flow matching decoder as v2, but DINOv2 ViT-S/14 replaces ViT-Tiny. Self-supervised spatial features encode where objects are, not just what they are. 384d embeddings, 256 tokens per camera. Smoother, more generalizable.
Version 1 — Baseline
PRANA v1

The one that
almost worked

83 teleoperation episodes. ViT-Tiny vision encoder, PaliGemma tokenizer (receiving zeros), 4-layer self-attention transformer, 50-step action chunking with L1 loss. It reached for the screwdriver but couldn't grasp consistently.

The failures were instructive: self-attention let action queries contaminate each other before seeing the scene. Deterministic regression averaged out the multimodal demonstrations. The unfrozen backbone overfitted in 2000 steps.

v1 — reaches but struggles to grasp
Version 2 — Flow Matching
PRANA v2

Iterative denoising with
correlated noise

Instead of predicting one average trajectory, the model starts from noise shaped like real robot motions and iteratively refines it into precise actions. Cross-attention so the decoder actually looks at the scene before deciding what to do.

PRANA v2 Architecture
v2 architecture — context encoder + cross-attention action decoder with flow matching
  • Frozen ViT-Tiny with camera-ID embeddings
  • 4-layer context encoder (self-attention over vision + state)
  • 7-layer action decoder (cross-attention to context)
  • Flow matching: 10 Euler denoising steps at inference
  • Correlated noise from Cholesky decomposition of action covariance
  • Rolling inpainting: execute 40, save 10 for smooth transitions
  • ~11M trainable params, 100K training steps
  • 22ms per chunk inference on RTX 5060
v2 — picks up screwdriver and places in box
Version 3 — DINOv2 Vision
PRANA v3

Same decoder.
Better eyes.

DINOv2 ViT-S/14 replaces ViT-Tiny. Self-supervised pretraining on 142M images means the backbone encodes where objects are spatially, not just what class they are. 384-dimensional embeddings, 256 tokens per camera — 2.6x more visual information.

PRANA v3 Architecture
v3 architecture — DINOv2 backbone with 512 visual tokens feeding the same flow matching decoder
  • DINOv2 ViT-S/14 frozen backbone (22M params)
  • 384d embeddings → 256 tokens per camera (vs 192d / 197 tokens)
  • 512 total visual tokens (vs 394 in v2)
  • Same flow matching decoder, same correlated noise
  • Smoother trajectories, better generalization to object placement
  • ~30M trainable params
  • Less jittery than v2, reduced position error at grasp point
v3 — smoother grasp, better generalization with DINOv2
Comparison

Side by side

Component
v1
v2
v3
Vision
ViT-Tiny (unfrozen)
ViT-Tiny (frozen)
DINOv2 ViT-S/14 (frozen)
Attention
Self-attention only
Cross-attention decoder
Cross-attention decoder
Action head
L1 regression
Flow matching (10 steps)
Flow matching (10 steps)
Noise
N/A
Correlated (β=0.5)
Correlated (β=0.5)
Visual tokens
394
394
512
Inference
~5ms
~22ms
~30ms
Grasps
With assistance
Yes
Yes, smoother
Places in box
No
With assistance
With assistance
v2 — ViT-Tiny + flow matching
v3 — DINOv2 + flow matching
Links

Open source.
Everything.

Architecture, training scripts, deployment code, model weights, dataset, and the mistakes along the way.