Vision Language Action   2025

PRANA
Built from scratch.
Deployed on hardware.

I built an end-to-end VLA model from the ground up and deployed it on a real 7-DoF arm. No pretrained backbone shortcuts. It runs at 50 Hz and hits over 90% success on a physical pick and place task.

>90%
Task success
50 Hz
Control rate
500+
Training episodes
7-DoF
Physical platform
Deployment footage
Autonomous screwdriver retrieval on a physical 7-DoF platform
What this is

A VLA model I
actually built

PRANA stands for Perception-conditioned Robotic Network with Attention. I designed and trained every component from scratch — the vision encoders, the language embedding, the transformer architecture, and the action chunking mechanism.

The model fuses two camera streams, a language instruction, and the robot's joint state into a single sequence. A transformer processes that sequence and outputs 50 future actions in one forward pass. That chunk gets executed at 50 Hz on the physical arm.

I used LeRobot as the deployment harness and collected over 500 teleoperation episodes myself. The task is screwdriver retrieval in an unstructured environment — small object, tight tolerances, real consequences for bad predictions.

Architecture

Three inputs.
One sequence.
50 actions out.

The core idea is straightforward. I encode vision, language, and robot state into tokens of the same dimension, concatenate them into one sequence, and let a transformer figure out what the arm should do next. 50 learnable action queries read from that context and output a full action horizon in a single shot.

This means no autoregressive decoding, no waiting. The whole chunk comes out at once and the arm executes it. That's how I get 50 Hz on real hardware without a GPU strapped to the robot.

📷
Camera 1
Table view · 224×224
📷
Camera 2
Wrist view · 224×224
🔤
Instruction
Language tokens · 16
ViT-Tiny
197 patch tokens × 2
ViT-Tiny
197 patch tokens
Embedding
256k vocab · 16 tokens
↓   concatenate with joint state (1 token)
394 visual + 16 language + 1 state + 50 action queries
Transformer Encoder
4 layers · 8 attention heads · Pre-LN · dim 256
↓ last 50 tokens
Action Head   Linear → GELU → Linear
Action chunk [50 steps × 6 DoF]   executed at 50 Hz
Training

500 episodes.
All mine.

I collected every single training episode myself through teleoperation using the LeRobot recording stack. Two synchronized camera streams, proprioceptive joint state, and a fixed language instruction per episode.

The key training design decision was action chunking. Instead of predicting one action per step, the model predicts 50 steps at once. This collapses compounding error and lets the robot commit to smooth trajectories rather than reacting jerkily every timestep.

  • 500+ teleoperation episodes collected personally
  • Dual camera streams at 224×224, streamed from parquet files
  • Episode boundary integrity enforced — no chunk spans two episodes
  • L1 loss with padding mask on terminal frames
  • AdamW optimizer, lr 1e-4, weight decay 1e-4
  • MEAN_STD normalisation across visual, state, and action modalities
  • PaliGemma tokenizer — max length 16 tokens per instruction
  • Statistics serialised with the checkpoint for deployment
Deployment

How it runs on
the actual arm

Getting a trained policy onto real hardware is a different problem from training it. Here is exactly how PRANA goes from a checkpoint to motor commands on the physical arm.

01
Factory hook into LeRobot
PranaPolicy subclasses PreTrainedPolicy and registers under "prana_v1". A lightweight factory hook routes LeRobot's recording and evaluation loop through my policy without touching the framework internals.
LeRobot   Zero fork
02
Action chunk inference
On the first call to select_action(), the model runs one forward pass and produces 50 actions. These go into a deque. Every subsequent control tick just pops the next action — inference cost is amortised over 50 steps.
50 Hz   Single inference
03
Un-normalisation to motor radians
The network outputs normalised actions. The training dataset statistics (mean and std) are saved alongside the checkpoint and loaded at inference time. Every action gets multiplied by std and added to mean before going to the arm.
Checkpoint-bound stats
04
Motor commands to the 7-DoF arm
Physical motor radians sent directly to the joint controllers. The arm executes the chunk, the queue refills at the next inference cycle, and the loop continues until task completion.
Real hardware   7-DoF
System

Every component
accounted for

Vision Encoder
vit_tiny_patch16_224
Extracts 197 patch tokens from each camera stream. Two cameras run in parallel — 394 visual tokens total.
Language Encoder
nn.Embedding (256k vocab)
Maps the tokenised instruction into the same hidden space as the visual tokens. Trained end-to-end.
State Encoder
nn.Linear
Compresses 6-DoF joint angles into a single proprioceptive token and injects it into the sequence.
Action Queries
nn.Parameter
50 learnable vectors that attend over the full context and get updated by the transformer into action predictions.
Transformer
TransformerEncoder · 4L · 8H
Pre-LN transformer encoder. Processes the full sequence and updates the action queries with contextual information.
Action Head
Linear → GELU → Linear
Projects the 50 updated query tokens into 50 steps of 6-DoF joint commands. One forward pass, full horizon out.
Tokenizer
PaliGemma 3B tokenizer
Converts the instruction string into up to 16 integer token IDs. Padded to fixed length at training and inference.
Framework
LeRobot (Hugging Face)
Handles data collection, training loop, checkpoint saving, and the hardware deployment harness.