Capability notes
Vision-Language-Action (VLA) models — models that take visual input and language instructions and output robot actions — are the frontier for AI-driven robot control. In 2026, three families dominate: **OpenVLA** (open-weight, fine-tunable), **RT-2-X** (Google research, limited open release), and **Pi0** (Physical Intelligence, API-gated). Octo (UC Berkeley) is the smaller open-weight alternative.
**OpenVLA** is a 7B-parameter VLA on a Prismatic VLM backbone (SigLIP vision + Llama language) fine-tuned on the Open X-Embodiment dataset — 1M+ real trajectories across 60+ robot embodiments. It outputs continuous action chunks (7-DoF arm positions, gripper states) at 5-10 Hz. On [RTX 4090](/hardware/rtx-4090) via [transformers](/tools/transformers): 15-25 tok/s, translating to 3-5 Hz control loop with image encoding. Fine-tuning on a specific robot requires 50-200 demonstrations for 70-85% task completion on pick-and-place. Generalization to novel objects is 60-70% zero-shot, dropping to 30-40% in novel environments.
**RT-2-X** (55B full, 15B open variant) outperforms OpenVLA on generalization (75-85% on novel objects) but requires 36-40GB VRAM and is built in JAX — porting to PyTorch for non-Google hardware is non-trivial. **Pi0** claims 90%+ success on manipulation benchmarks but is API-gated — no local deployment.
**Fine-tuning is mandatory.** No VLA generalizes out-of-the-box to a novel robot. The sim-to-real gap is 30-50% — models trained in simulation fail on physical hardware due to visual texture differences, lighting changes, camera calibration drift, and physics mismatches. Operators budget 2-6 weeks for fine-tuning data collection (50-200 demonstrations per task per robot).
If you just want to try this
Lowest-friction path to a working setup.
There is no beginner path for safe, working on-device robot AI. Start in simulation (MuJoCo + OpenVLA) before touching hardware. Budget 2-4 weeks for a working simulation pipeline.
Step 1: Install MuJoCo (Google DeepMind's free physics simulator) and robosuite (robot simulation framework). These run on any laptop with a GPU. An [RTX 3060 12GB](/hardware/rtx-3060-12gb) is adequate for single-robot simulation.
Step 2: Load OpenVLA via Hugging Face [transformers](/tools/transformers): `openvla/openvla-7b`. Requires ~14GB VRAM — [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb) minimum. FP16 default; Q8 quantization via bitsandbytes reduces to 8GB with marginal quality loss.
Step 3: Set up a simulated Franka Panda arm in robosuite with a pick-and-place task. Use OpenVLA in a perception-action loop: capture simulated frame → vision encoder → generate action chunk → execute in MuJoCo → repeat. This runs at 3-5 Hz in simulation.
Step 4: Evaluate. Expect 40-60% success on simple pick-and-place zero-shot. Run 100 trials for statistical significance. If below 50%, fine-tuning is required.
Step 5: Fine-tune with LoRA on 50-100 simulated demonstrations collected via keyboard/mouse teleoperation. LoRA requires ~20GB VRAM (rank=16). Training: 4-8 hours on [RTX 4090](/hardware/rtx-4090). Full fine-tuning needs 40GB+ VRAM.
Only when sim success exceeds 80% should you consider physical deployment — and expect sim-to-real to drop that to 50-60% on first hardware transfer. Physical deployment requires safety infrastructure (e-stop, velocity limits, collision detection) that simulation does not enforce.
For production deployment
Operator-grade recommendation.
Production robot AI is safety-critical engineering. A VLA controlling physical hardware has failure modes that damage equipment and injure people.
**Safety-critical inference.** Minimum safe architecture: **dual inference with disagreement detection**. Run the same VLA on two independent GPUs, compare action vectors, and execute only when difference is below a calibrated threshold (<5% joint range for position, <10% for velocity). On divergence, fall back to a conservative hard-coded policy (stop or hold). Two physically separate GPUs are required — two virtual instances share failure modes. For one robot arm: dual [RTX 4070 Ti](/hardware/rtx-4070-ti) (~$1,600 total) or dual [RTX 4090](/hardware/rtx-4090) (~$3,600).
**Latency requirements.** Control loop must meet the robot's minimum frequency. For Franka Panda (1 kHz internal, 100 Hz external), VLA at 5 Hz generates waypoints; a motion planner interpolates at 100 Hz. Gripper control at 5 Hz is adequate. Latency targets: image encoding <50ms, LLM inference <150ms, action decoding <20ms — total <220ms. If any component exceeds budget, hold the current waypoint (safe). Never extrapolate stale waypoints (dangerous).
**Edge vs cloud.**
- Edge (local GPU): Fixed cost, deterministic latency (std dev <10ms), works during network outages. Cost: $2,000-10,000 one-time per robot. Default for safety-critical.
- Cloud API: Variable cost, variable latency (30-300ms round-trip), internet-dependent. For non-safety-critical tasks (inventory scanning, monitoring) where stalls are acceptable.
**Fine-tuning for deployment.** Collect 100-200 demonstrations on the specific physical hardware (not sim). Each demonstration is teleoperated, producing (image, joint_positions, gripper_state, task_text) tuples. LoRA fine-tune for 5-10 epochs. Production threshold: >90% task completion on validation, <5% action variance in repeated scenarios. Below threshold: collect more demonstrations and retrain.
**Calibration maintenance.** The VLA's vision encoder assumes fixed camera extrinsics. A bumped camera degrades predictions — not slightly wrong, but suddenly wrong in workspace regions. Daily calibration: execute 5 pre-recorded trajectories, compare joint encoders to expected values. >0.5° deviation triggers recalibration before VLA operation. This is a maintenance burden non-robotics ML engineers systematically underestimate.
What breaks
Failure modes operators see in the wild.
- **Sim-to-real distribution shift causing dangerous actions.** Models trained in simulation encounter different textures, lighting, and dynamics on hardware. Symptom: 85% success in sim, 35% on physical robot — collisions, overshoot, excessive force. Mitigation: domain randomization during sim training (randomize lighting, textures, camera position ±5%). Collect 20%+ of demonstrations on physical hardware. Implement force/torque limits at the controller level that override VLA outputs.
- **Latency spike causing control loop failure.** A missed VLA inference delays control output by 100-200ms. If the motion planner interpolates on stale waypoints, the robot moves toward an outdated target. Symptom: overshoots grasp, knocks over object. Mitigation: decelerate to zero if next waypoint not received within 2x expected interval. Discard waypoints older than 30ms.
- **VLA hallucinating impossible joint configurations.** The model learns joint limits from training data — if edge cases are absent, it commands unreachable positions. Symptom: robot triggers emergency stop attempting motion through joint limits. Mitigation: clamp VLA outputs to joint physical range. Forward/inverse kinematics verify reachability. Fall back to safe pose if unreachable.
- **Safety constraint violation in generated trajectories.** VLAs imitate human demonstrations, including operator violations (moving fast near fragile objects, entering another robot's workspace). Symptom: after 200 deployments, robot executes high-velocity motion near human. Mitigation: safety filter layer between VLA and actuation enforces hard constraints (velocity ceiling, keep-out zones, force limits). The VLA proposes; the safety filter approves or rejects.
- **Calibration drift.** Thermal expansion or accidental bumping shifts the camera by 1-3mm. The learned pixel-to-action mapping no longer corresponds to reality. Symptom: grasp accuracy drops from 90% to 60% over three days. Mitigation: daily automated recalibration using a known target (ArUco marker) in the workspace. Log calibration parameters — drift >2mm/day indicates mechanical issue.
- **Gripper state hallucination on transparent/reflective objects.** Depth estimation fails on glass, reflective metal, black-on-black objects. Symptom: robot crushes an object because VLA perceived it as already grasped. Mitigation: augment vision with gripper motor current sensing. Fuse VLA output with haptic force/torque data for multi-modal state estimation.
Hardware guidance
**Hobbyist: Simulation-only (12GB+ VRAM)**
[RTX 3060 12GB](/hardware/rtx-3060-12gb) or [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb) runs OpenVLA inference at 3-5 Hz and MuJoCo at real-time. For learning, experimentation, and demonstration collection — not physical control. Fine-tuning with LoRA requires moving to 24GB+ GPU or cloud rental.
**SMB: Single physical robot + inference GPU**
One [RTX 4090](/hardware/rtx-4090) (24GB) runs OpenVLA at 5-8 Hz for pick-and-place at warehouse-adjacent speeds. Same GPU handles LoRA fine-tuning. Paired with a robot controller (Franka, UR) that enforces velocity/force limits. Cost: $1,800 GPU + $20,000-35,000 robot arm. For small-batch manufacturing, lab automation, or research labs with one robot.
**Enterprise: Production manipulation cell**
Dual [RTX 4090](/hardware/rtx-4090) for redundant inference on one arm. For 3-5 arms in a cell, a shared [NVIDIA L40S](/hardware/nvidia-l40s) (48GB) serves multiple VLA instances via [vLLM](/tools/vllm). Each arm retains local safety controllers. This tier is for manufacturing and logistics where a dropped part costs throughput, not safety.
**Frontier: Safety-critical (medical, aerospace)**
Dual [L40S](/hardware/nvidia-l40s) or [H100 PCIe](/hardware/nvidia-h100-pcie) with redundant inference + formal safety verification. VLA runs in a real-time OS for guaranteed deadlines — Linux's non-deterministic scheduler is unacceptable. A safety-certified controller monitors VLA output. Model frozen after validation — no updates without re-certification.
**Edge: Jetson Orin for mobile robots**
[NVIDIA Jetson Orin](/hardware/nvidia-dgx-spark) (32-64GB unified) runs quantized OpenVLA at 2-4 Hz, 30-60ms latency, 15-30W. For drones and autonomous mobile robots where carrying a desktop GPU is infeasible. Adequate for navigation and slow pick-and-place; marginal for dynamic tasks requiring >5 Hz.
Runtime guidance
**Evaluating VLAs in simulation? → Hugging Face Transformers + MuJoCo + robosuite**
Load OpenVLA via [transformers](/tools/transformers): `AutoModelForVision2Seq.from_pretrained("openvla/openvla-7b")`. Standard Prismatic VLM backbone. Pipeline: RGB image 224×224 → SigLIP vision encoder → concatenate with text tokens → Llama decode → parse action tokens. Runs 15-25 tok/s on [RTX 4090](/hardware/rtx-4090) (FP16), 8-12 tok/s on RTX 4070 (Q8). Combine with MuJoCo + robosuite for the research-standard evaluation pipeline. For benchmarking multiple VLAs (OpenVLA, Octo, RT-2-X), the Hugging Face ecosystem provides a unified interface.
**Deploying OpenVLA on a physical robot? → OpenVLA + robot-specific motion planner**
Same transformers pipeline, outputs go to robot controller instead of MuJoCo. Interface: OpenVLA outputs 7-DoF absolute joint positions + gripper state → motion planner interpolates to target at 100 Hz → joint controller executes. Use ROS 2 as middleware: VLA inference node publishes JointTrajectory messages; robot controller subscribes and executes. Implement watchdog: if VLA node doesn't publish within 1.5x expected interval, decelerate to zero.
**Using RT-2-X? → JAX + Google TPU, or community PyTorch port**
RT-2-X is JAX-built and TPU-optimized. For serious deployment, Google Cloud TPU v5e ($2.50/TPU-hour) delivers 20-30 Hz inference. The open-weight release can be loaded in PyTorch via community port but performance is suboptimal. RT-2-X is Google-infrastructure-locked as of 2026. Plan around OpenVLA for local deployment.
**Using Pi0? → Cloud API only**
Not downloadable. Physical Intelligence's managed API provides 50-150ms latency. Per-request pricing is partnership-gated. Pi0 is the capability ceiling but not local. Start with OpenVLA for local control loop reliability; consider Pi0 only if OpenVLA's quality is insufficient.
**Simulation stack for pre-deployment:**
- MuJoCo: Fast, open-source physics, GPU-accelerated, Python bindings
- robosuite: Standardized robot environments (Panda, Sawyer, UR5) with task definitions
- Isaac Sim (NVIDIA): Higher-fidelity rendering for photorealism-driven sim-to-real transfer. Requires RTX GPU
- SAPIEN (Stanford): Articulated object manipulation (doors, drawers, cabinets) — weaker physics but stronger articulation