Scientific
embodied ai
vla
robot learning

Robotics

Vision-language-action models for robotics. RT-2, Open X-Embodiment, RDT-1B, Pi0.

Capability notes

Vision-Language-Action (VLA) models — models that take visual input and language instructions and output robot actions — are the frontier for AI-driven robot control. In 2026, three families dominate: **OpenVLA** (open-weight, fine-tunable), **RT-2-X** (Google research, limited open release), and **Pi0** (Physical Intelligence, API-gated). Octo (UC Berkeley) is the smaller open-weight alternative. **OpenVLA** is a 7B-parameter VLA on a Prismatic VLM backbone (SigLIP vision + Llama language) fine-tuned on the Open X-Embodiment dataset — 1M+ real trajectories across 60+ robot embodiments. It outputs continuous action chunks (7-DoF arm positions, gripper states) at 5-10 Hz. On [RTX 4090](/hardware/rtx-4090) via [transformers](/tools/transformers): 15-25 tok/s, translating to 3-5 Hz control loop with image encoding. Fine-tuning on a specific robot requires 50-200 demonstrations for 70-85% task completion on pick-and-place. Generalization to novel objects is 60-70% zero-shot, dropping to 30-40% in novel environments. **RT-2-X** (55B full, 15B open variant) outperforms OpenVLA on generalization (75-85% on novel objects) but requires 36-40GB VRAM and is built in JAX — porting to PyTorch for non-Google hardware is non-trivial. **Pi0** claims 90%+ success on manipulation benchmarks but is API-gated — no local deployment. **Fine-tuning is mandatory.** No VLA generalizes out-of-the-box to a novel robot. The sim-to-real gap is 30-50% — models trained in simulation fail on physical hardware due to visual texture differences, lighting changes, camera calibration drift, and physics mismatches. Operators budget 2-6 weeks for fine-tuning data collection (50-200 demonstrations per task per robot).

If you just want to try this

Lowest-friction path to a working setup.

There is no beginner path for safe, working on-device robot AI. Start in simulation (MuJoCo + OpenVLA) before touching hardware. Budget 2-4 weeks for a working simulation pipeline. Step 1: Install MuJoCo (Google DeepMind's free physics simulator) and robosuite (robot simulation framework). These run on any laptop with a GPU. An [RTX 3060 12GB](/hardware/rtx-3060-12gb) is adequate for single-robot simulation. Step 2: Load OpenVLA via Hugging Face [transformers](/tools/transformers): `openvla/openvla-7b`. Requires ~14GB VRAM — [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb) minimum. FP16 default; Q8 quantization via bitsandbytes reduces to 8GB with marginal quality loss. Step 3: Set up a simulated Franka Panda arm in robosuite with a pick-and-place task. Use OpenVLA in a perception-action loop: capture simulated frame → vision encoder → generate action chunk → execute in MuJoCo → repeat. This runs at 3-5 Hz in simulation. Step 4: Evaluate. Expect 40-60% success on simple pick-and-place zero-shot. Run 100 trials for statistical significance. If below 50%, fine-tuning is required. Step 5: Fine-tune with LoRA on 50-100 simulated demonstrations collected via keyboard/mouse teleoperation. LoRA requires ~20GB VRAM (rank=16). Training: 4-8 hours on [RTX 4090](/hardware/rtx-4090). Full fine-tuning needs 40GB+ VRAM. Only when sim success exceeds 80% should you consider physical deployment — and expect sim-to-real to drop that to 50-60% on first hardware transfer. Physical deployment requires safety infrastructure (e-stop, velocity limits, collision detection) that simulation does not enforce.

For production deployment

Operator-grade recommendation.

Production robot AI is safety-critical engineering. A VLA controlling physical hardware has failure modes that damage equipment and injure people. **Safety-critical inference.** Minimum safe architecture: **dual inference with disagreement detection**. Run the same VLA on two independent GPUs, compare action vectors, and execute only when difference is below a calibrated threshold (<5% joint range for position, <10% for velocity). On divergence, fall back to a conservative hard-coded policy (stop or hold). Two physically separate GPUs are required — two virtual instances share failure modes. For one robot arm: dual [RTX 4070 Ti](/hardware/rtx-4070-ti) (~$1,600 total) or dual [RTX 4090](/hardware/rtx-4090) (~$3,600). **Latency requirements.** Control loop must meet the robot's minimum frequency. For Franka Panda (1 kHz internal, 100 Hz external), VLA at 5 Hz generates waypoints; a motion planner interpolates at 100 Hz. Gripper control at 5 Hz is adequate. Latency targets: image encoding <50ms, LLM inference <150ms, action decoding <20ms — total <220ms. If any component exceeds budget, hold the current waypoint (safe). Never extrapolate stale waypoints (dangerous). **Edge vs cloud.** - Edge (local GPU): Fixed cost, deterministic latency (std dev <10ms), works during network outages. Cost: $2,000-10,000 one-time per robot. Default for safety-critical. - Cloud API: Variable cost, variable latency (30-300ms round-trip), internet-dependent. For non-safety-critical tasks (inventory scanning, monitoring) where stalls are acceptable. **Fine-tuning for deployment.** Collect 100-200 demonstrations on the specific physical hardware (not sim). Each demonstration is teleoperated, producing (image, joint_positions, gripper_state, task_text) tuples. LoRA fine-tune for 5-10 epochs. Production threshold: >90% task completion on validation, <5% action variance in repeated scenarios. Below threshold: collect more demonstrations and retrain. **Calibration maintenance.** The VLA's vision encoder assumes fixed camera extrinsics. A bumped camera degrades predictions — not slightly wrong, but suddenly wrong in workspace regions. Daily calibration: execute 5 pre-recorded trajectories, compare joint encoders to expected values. >0.5° deviation triggers recalibration before VLA operation. This is a maintenance burden non-robotics ML engineers systematically underestimate.

What breaks

Failure modes operators see in the wild.

- **Sim-to-real distribution shift causing dangerous actions.** Models trained in simulation encounter different textures, lighting, and dynamics on hardware. Symptom: 85% success in sim, 35% on physical robot — collisions, overshoot, excessive force. Mitigation: domain randomization during sim training (randomize lighting, textures, camera position ±5%). Collect 20%+ of demonstrations on physical hardware. Implement force/torque limits at the controller level that override VLA outputs. - **Latency spike causing control loop failure.** A missed VLA inference delays control output by 100-200ms. If the motion planner interpolates on stale waypoints, the robot moves toward an outdated target. Symptom: overshoots grasp, knocks over object. Mitigation: decelerate to zero if next waypoint not received within 2x expected interval. Discard waypoints older than 30ms. - **VLA hallucinating impossible joint configurations.** The model learns joint limits from training data — if edge cases are absent, it commands unreachable positions. Symptom: robot triggers emergency stop attempting motion through joint limits. Mitigation: clamp VLA outputs to joint physical range. Forward/inverse kinematics verify reachability. Fall back to safe pose if unreachable. - **Safety constraint violation in generated trajectories.** VLAs imitate human demonstrations, including operator violations (moving fast near fragile objects, entering another robot's workspace). Symptom: after 200 deployments, robot executes high-velocity motion near human. Mitigation: safety filter layer between VLA and actuation enforces hard constraints (velocity ceiling, keep-out zones, force limits). The VLA proposes; the safety filter approves or rejects. - **Calibration drift.** Thermal expansion or accidental bumping shifts the camera by 1-3mm. The learned pixel-to-action mapping no longer corresponds to reality. Symptom: grasp accuracy drops from 90% to 60% over three days. Mitigation: daily automated recalibration using a known target (ArUco marker) in the workspace. Log calibration parameters — drift >2mm/day indicates mechanical issue. - **Gripper state hallucination on transparent/reflective objects.** Depth estimation fails on glass, reflective metal, black-on-black objects. Symptom: robot crushes an object because VLA perceived it as already grasped. Mitigation: augment vision with gripper motor current sensing. Fuse VLA output with haptic force/torque data for multi-modal state estimation.

Hardware guidance

**Hobbyist: Simulation-only (12GB+ VRAM)** [RTX 3060 12GB](/hardware/rtx-3060-12gb) or [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb) runs OpenVLA inference at 3-5 Hz and MuJoCo at real-time. For learning, experimentation, and demonstration collection — not physical control. Fine-tuning with LoRA requires moving to 24GB+ GPU or cloud rental. **SMB: Single physical robot + inference GPU** One [RTX 4090](/hardware/rtx-4090) (24GB) runs OpenVLA at 5-8 Hz for pick-and-place at warehouse-adjacent speeds. Same GPU handles LoRA fine-tuning. Paired with a robot controller (Franka, UR) that enforces velocity/force limits. Cost: $1,800 GPU + $20,000-35,000 robot arm. For small-batch manufacturing, lab automation, or research labs with one robot. **Enterprise: Production manipulation cell** Dual [RTX 4090](/hardware/rtx-4090) for redundant inference on one arm. For 3-5 arms in a cell, a shared [NVIDIA L40S](/hardware/nvidia-l40s) (48GB) serves multiple VLA instances via [vLLM](/tools/vllm). Each arm retains local safety controllers. This tier is for manufacturing and logistics where a dropped part costs throughput, not safety. **Frontier: Safety-critical (medical, aerospace)** Dual [L40S](/hardware/nvidia-l40s) or [H100 PCIe](/hardware/nvidia-h100-pcie) with redundant inference + formal safety verification. VLA runs in a real-time OS for guaranteed deadlines — Linux's non-deterministic scheduler is unacceptable. A safety-certified controller monitors VLA output. Model frozen after validation — no updates without re-certification. **Edge: Jetson Orin for mobile robots** [NVIDIA Jetson Orin](/hardware/nvidia-dgx-spark) (32-64GB unified) runs quantized OpenVLA at 2-4 Hz, 30-60ms latency, 15-30W. For drones and autonomous mobile robots where carrying a desktop GPU is infeasible. Adequate for navigation and slow pick-and-place; marginal for dynamic tasks requiring >5 Hz.

Runtime guidance

**Evaluating VLAs in simulation? → Hugging Face Transformers + MuJoCo + robosuite** Load OpenVLA via [transformers](/tools/transformers): `AutoModelForVision2Seq.from_pretrained("openvla/openvla-7b")`. Standard Prismatic VLM backbone. Pipeline: RGB image 224×224 → SigLIP vision encoder → concatenate with text tokens → Llama decode → parse action tokens. Runs 15-25 tok/s on [RTX 4090](/hardware/rtx-4090) (FP16), 8-12 tok/s on RTX 4070 (Q8). Combine with MuJoCo + robosuite for the research-standard evaluation pipeline. For benchmarking multiple VLAs (OpenVLA, Octo, RT-2-X), the Hugging Face ecosystem provides a unified interface. **Deploying OpenVLA on a physical robot? → OpenVLA + robot-specific motion planner** Same transformers pipeline, outputs go to robot controller instead of MuJoCo. Interface: OpenVLA outputs 7-DoF absolute joint positions + gripper state → motion planner interpolates to target at 100 Hz → joint controller executes. Use ROS 2 as middleware: VLA inference node publishes JointTrajectory messages; robot controller subscribes and executes. Implement watchdog: if VLA node doesn't publish within 1.5x expected interval, decelerate to zero. **Using RT-2-X? → JAX + Google TPU, or community PyTorch port** RT-2-X is JAX-built and TPU-optimized. For serious deployment, Google Cloud TPU v5e ($2.50/TPU-hour) delivers 20-30 Hz inference. The open-weight release can be loaded in PyTorch via community port but performance is suboptimal. RT-2-X is Google-infrastructure-locked as of 2026. Plan around OpenVLA for local deployment. **Using Pi0? → Cloud API only** Not downloadable. Physical Intelligence's managed API provides 50-150ms latency. Per-request pricing is partnership-gated. Pi0 is the capability ceiling but not local. Start with OpenVLA for local control loop reliability; consider Pi0 only if OpenVLA's quality is insufficient. **Simulation stack for pre-deployment:** - MuJoCo: Fast, open-source physics, GPU-accelerated, Python bindings - robosuite: Standardized robot environments (Panda, Sawyer, UR5) with task definitions - Isaac Sim (NVIDIA): Higher-fidelity rendering for photorealism-driven sim-to-real transfer. Requires RTX GPU - SAPIEN (Stanford): Articulated object manipulation (doors, drawers, cabinets) — weaker physics but stronger articulation

Setup walkthrough

  1. Install NVIDIA Isaac Sim (developer.nvidia.com/isaac-sim) — free, requires RTX GPU, ~20 GB download.
  2. Isaac Sim provides physics-accurate robot simulation + Isaac Lab for RL training.
  3. For imitation learning / VLA models: pip install lerobot (HuggingFace LeRobot — open-source robot learning).
  4. LeRobot comes with pre-trained ACT and Diffusion Policy models for common manipulation tasks. Connect a supported robot arm (Koch v1, SO-100, Aloha) via USB.
  5. Record 50-100 demonstrations of a task (e.g., "pick up cube and place in bin") using teleoperation.
  6. Train: python lerobot/scripts/train.py policy=act env=aloha — training takes 2-8 hours on RTX 3090.
  7. Deploy: python lerobot/scripts/eval.py — robot executes the learned policy. First successful pick-and-place after ~1 day of setup + training.

The cheap setup

Honestly: $300 cannot do meaningful local robotics AI. The compute is achievable (used GTX 1060 6 GB trains simple behavior cloning policies), but the robot hardware is the real cost. A minimal robot arm (Koch v1 kit, SoArm100) is $200-500. Sensors, cameras, and power supply add $200-300. Total minimum robotics setup: $600-800. For simulation-only research (no physical robot): $300 gets a used GTX 1060 6 GB ($60) + refurbished PC (~$200) — runs Isaac Sim at minimum settings for simple RL environments. But you won't deploy to a physical robot at this budget.

The serious setup

Used RTX 3090 24 GB ($700-900, see /hardware/rtx-3090). Runs Isaac Sim at high settings for realistic sensor simulation. Trains ACT/Diffusion Policy on 100 demonstrations in 2-4 hours. Can run VLA models (RT-2-X, OpenVLA 7B) for language-conditioned manipulation. Pair with Ryzen 7 7700X + 64 GB DDR5 + 2TB NVMe. Compute total: ~$1,800-2,200. Physical robot: SO-100 arm ($400), Intel RealSense D435 camera ($300), power supply + mounting ($200). Full setup: $2,700-3,100. Jetson Orin AGX ($2,000, see /hardware/jetson-ai) for on-robot inference.

Common beginner mistake

The mistake: Training an RL policy entirely in simulation and expecting it to work on the physical robot without any sim-to-real transfer effort. Why it fails: Simulators simplify physics — perfect friction, zero latency, ideal lighting, no joint backlash. The policy overfits to simulation artifacts ("sim2real gap") and fails catastrophically on the real robot. The fix: Use domain randomization during sim training (vary lighting, friction, object mass, camera position). Collect 10-20 real-world demonstrations and fine-tune the sim-trained policy on them. The hybrid approach (sim pre-training + real fine-tuning) reduces real-world data needs by 90% while maintaining sim2real transfer quality. Physical robots break things — start in simulation, but always plan for sim2real adaptation.

Recommended setup for robotics

Recommended runtimes

Browse all tools for runtimes that fit this workload.

Reality check

Local AI workloads have real hardware constraints that vary by task type. VRAM ceiling decides what model fits; bandwidth decides decode speed; compute decides prefill speed. Pick the GPU tier that fits your actual workload, not the spec sheet.

Common mistakes

  • Buying for spec-sheet VRAM without modeling KV cache + activation overhead
  • Underestimating quantization quality loss below Q4
  • Skipping flash-attention support (real perf gap on long context)
  • Ignoring sustained-load thermals (laptops thermal-throttle within 30 min)

What breaks first

The errors most operators hit when running robotics locally. Each links to a diagnose+fix walkthrough.

Before you buy

Verify your specific hardware can handle robotics before committing money.

Specialized buyer guides
Updated 2026 roundup