Capability notes
Open-weight text-to-video is the least mature AI modality — models arrived 12-18 months after text-to-image equivalents and trail closed APIs in temporal consistency, resolution, and prompt adherence. Three models (all on [ComfyUI](/tools/comfyui)) define the landscape: **Wan 2.1** (Alibaba, strongest all-around quality), **HunyuanVideo** (Tencent, highest resolution at 720×1280), **LTX-Video** (Lightricks, fastest at 2-4s per clip).
**Wan 2.1** is the default recommendation. Produces 5-second clips at 480×832 (16fps) with class-leading temporal consistency — minimal flickering, characters maintain appearance across frames. 14B T2V model requires ~24 GB VRAM at FP16, ~12 GB at FP8. Quality gap vs closed-source (Sora, Kling, Runway Gen-3) is 12-18 months — Wan generates short, medium-quality clips; closed APIs generate longer, higher-resolution, more temporally stable video. Prompt adherence ~70-80% of closed API benchmarks.
**HunyuanVideo** pushes higher resolution — native 720×1280 at 129 frames (~8.5s at 15fps) with 13B model. Higher quality ceiling than Wan, at 48 GB minimum VRAM at FP16 for full resolution. Multi-resolution training (256p-1080p) handles aspect ratios naturally. Best open-weight option for vertical video (9:16).
**LTX-Video** prioritizes speed — 5-second clips in 2-4 seconds on [RTX 4090](/hardware/rtx-4090), 10-20× faster than Wan (30-60s) and Hunyuan (60-120s). Quality noticeably lower (more flickering, less detail) but speed enables rapid iteration. Fits 12 GB at FP8.
**Frame limitations**: All open-weight models exhibit temporal degradation beyond 24-32 frames (1.5-2s). Small errors compound exponentially in autoregressive generation. Full-video-attention architectures (processing all frames simultaneously) are in research but not yet open-weight.
If you just want to try this
Lowest-friction path to a working setup.
Set expectations first: open-weight video generation produces short (5s), medium-resolution (480p-720p), temporally-imperfect clips. This isn't Sora quality. The models are free and private, but need patience and high-end hardware.
Download [ComfyUI](https://github.com/comfyanonymous/ComfyUI) — standard runtime for Wan, Hunyuan, LTX-Video. Install ComfyUI-WanVideoWrapper custom node via ComfyUI Manager. Download Wan 2.1 T2V 14B weights from Hugging Face (Wan-AI/Wan2.1-T2V-14B) — FP16 weights ~27 GB, FP8 ~14 GB with nearly identical quality.
You need 24 GB+ VRAM. [RTX 4090 24GB](/hardware/rtx-4090): Wan 2.1 FP8, 5s clip at 480×832 in 30-60s. [RTX 3090 24GB](/hardware/rtx-3090): 45-90s (lower bandwidth: 936 GB/s vs 4090's 1 TB/s). Start with simplest workflow: Load Wan Model → Wan Text-to-Video → Video Combine (frames to MP4).
Prompt format matters: "A [subject] [doing action] in [environment], [camera movement], [lighting], [style]." Example: "A woman walking through a bamboo forest, camera tracking left to right, soft morning light, cinematic 24fps." Camera movement + frame rate hints help the model lock onto motion patterns.
If 24 GB unavailable, switch to LTX-Video (Lightricks/LTX-Video) via ComfyUI-LTXVideo node. Fits 12 GB FP8, generates in 2-4s. Quality is noticeably lower. For CPU-only or [Apple Silicon](/hardware/macbook-pro-16-m4-max): video generation isn't practical — even LTX-Video takes 10-30 minutes per clip on M-series GPU. Rent cloud GPU if you don't own suitable hardware.
For production deployment
Operator-grade recommendation.
Production text-to-video with open-weight models is early-stage. The quality/cost/speed tradeoff currently favors closed APIs for most commercial use. Self-hosted is for (1) privacy-sensitive content, (2) high-volume where API costs dominate hardware, (3) custom fine-tuning on proprietary styles.
**Batch pipeline**: [ComfyUI](/tools/comfyui) headless as video generation API. Architecture: request queue (Redis) → ComfyUI worker (one per GPU) → frame output → FFmpeg encoding → cloud storage → callback. Single [RTX 5090 32GB](/hardware/rtx-5090): ~60 Wan clips/hour. For 10,000 clips/month: ~167 GPU-hours. At $3-5/hour cloud: $500-835/month vs Kling/Runway API at $0.05-0.20/clip ($500-2,000/month) — cost-competitive.
**When closed APIs beat self-hosted**: (1) Quality — Sora, Kling 2.0, Runway Gen-3 produce visibly better video at higher resolution, longer duration, better temporal consistency. (2) Speed — Kling: 10-20s per clip vs Wan's 30-60s. (3) Resolution — 1080p default vs 480p requiring upscaling. (4) Support — frame interpolation, upscaling, format conversion included.
**When self-hosted wins**: (1) Privacy — proprietary footage where cloud upload violates data policy. (2) Volume — 100,000+ clips/month, hardware amortization favors self-hosting. (3) Customization — LoRA fine-tuning on proprietary visual styles possible with open-weight, impossible with closed APIs.
**Pipeline architecture**: (1) Prompt validation/enrichment → (2) [ComfyUI](/tools/comfyui) generation on GPU cluster → (3) Frame interpolation (RIFE) 16→30fps → (4) AI upscaling (Real-ESRGAN) 480p→1080p → (5) Quality gate (CLIP score, aesthetic scoring, NSFW detection) → (6) FFmpeg encode (H.264/H.265) → delivery. Cost per clip (Wan, [RTX 5090](/hardware/rtx-5090)): $0.05-0.10 GPU compute, $0.01-0.02 post-processing, $0.0001-0.001 storage ≈ $0.06-0.12 total self-hosted vs $0.05-0.20 API.
What breaks
Failure modes operators see in the wild.
- **Temporal inconsistency (flickering, morphing).** Small texture/color shifts accumulate across frames — objects "swim," backgrounds shift. Mitigation: generate at higher fps with temporal smoothing across 3-5 frame windows. Use frame interpolation (RIFE) between generated frames. Accept this as open-weight video generation tax.
- **Motion collapse (static-looking video).** Model generates near-static frames — minimal motion despite dynamic prompt. Mitigation: include explicit motion descriptors ("camera pans left," "subject walks forward," "particles drift upward"). Increase guidance scale (7-10). Wan handles motion better than LTX-Video.
- **Resolution/frame budget OOM.** Requesting 129 frames at 720×1280 on 24 GB GPU crashes mid-generation — model loads, allocates frame buffers, runs out of VRAM at frame ~50-80. Mitigation: pre-flight VRAM calculation. HunyuanVideo 720p: model ~26 GB + (129 frames × ~50 MB) = ~32 GB. Fall back to 512p or fewer frames.
- **Prompt adherence weaker than image gen.** Model ignores secondary prompt elements as frame count grows — "cat on red couch with window" becomes cat on beige couch, no window by frame 30. Mitigation: keep prompts simple (1-2 subjects, 1 action, 1 environment). Use I2V mode with established start frame, prompt only motion.
- **NSFW filter false-positives.** Wan and Hunyuan ship with Chinese-content-standards-trained safety filters. False positives on medical content, fitness content, swimwear, classical art. Mitigation: no clean workaround — filter is in model weights. Community fine-tunes may loosen but exist in legal gray area.
- **Multi-character identity confusion.** Two distinct characters swap attributes by mid-clip — model tracks one identity but struggles with two. Mitigation: generate characters separately and composite. Use I2V with initial frame showing both. Active research, no reliable open-weight solution.
Hardware guidance
**Minimum viable (24 GB VRAM)**: [RTX 3090 24GB](/hardware/rtx-3090) or [RTX 4090 24GB](/hardware/rtx-4090). Wan 2.1 FP8 at 480×832, 5-second clips. This is the absolute floor — below 24 GB, video generation is not practical. 30-90 seconds per clip on 4090, 45-120 seconds on 3090 (lower memory bandwidth: 936 GB/s vs 4090's 1 TB/s impacts video throughput more than image). [Apple M3 Ultra 64GB](/hardware/apple-m3-ultra) via MLX — 2-4× slower but unified memory enables higher frame counts or resolution if you accept the speed penalty.
**Recommended (32 GB)**: [RTX 5090 32GB](/hardware/rtx-5090). Single-card sweet spot — Wan 2.1 FP16 (~27 GB model) with full 5s generation. Also HunyuanVideo at 512×720 reduced resolution. 30-60s per Wan clip, 60-90s per Hunyuan.
**Enterprise (48 GB+)**: [NVIDIA L40S](/hardware/nvidia-l40s) 48 GB or [RTX A6000](/hardware/rtx-a6000) 48 GB. HunyuanVideo 720×1280, 129 frames at FP16. Wan 2.1 FP16 extended 81-frame generation. [NVIDIA H200](/hardware/nvidia-h200) 141 GB for experimental 200+ frame generation (10+ seconds).
**Multi-GPU**: Video gen models don't natively support tensor parallelism — can't split one generation across GPUs. Run multiple ComfyUI workers per GPU. 4× [RTX 5090](/hardware/rtx-5090) = 4 concurrent generations, ~240 clips/hour. 8× [L40S](/hardware/nvidia-l40s) = 8 concurrent, ~480 clips/hour.
**VRAM scaling**: LTX-Video FP8 480p: 10-12 GB. Wan 2.1 FP8 480p 5s: 12-16 GB. Wan 2.1 FP16 480p 5s: 22-28 GB. HunyuanVideo FP16 512p: 26-32 GB. HunyuanVideo FP16 720p 129fr: 42-50 GB.
Runtime guidance
**If generating video clips interactively** → [ComfyUI](https://github.com/comfyanonymous/ComfyUI) with WanVideoWrapper, HunyuanVideoWrapper, or LTXVideo nodes. The standard and only practical runtime for open-weight video generation. Wan, Hunyuan, and LTX-Video all ship ComfyUI workflow support as primary integration. Native Diffusers support experimental — use ComfyUI for reliability.
**If building programmatic video generation** → ComfyUI API mode (`--listen`, POST workflows to `/prompt`). API accepts parameterizable workflow JSON — inject prompts, seeds, model paths. Build Python wrapper: construct workflow JSON → submit → poll → download video output.
**If speed is priority** → LTX-Video on [RTX 4090](/hardware/rtx-4090) or [RTX 5090](/hardware/rtx-5090). 2-4s per clip. Lowest quality — use for real-time previews, iterative prompt refinement to find best seed, then commit to Wan 2.1 for final quality.
**If quality is priority** → Wan 2.1 FP16 + RIFE interpolation + Real-ESRGAN upscaling. Pipeline: Wan at 16fps → RIFE to 30fps → Real-ESRGAN 480p→1080p. 90-180s per clip on [RTX 5090](/hardware/rtx-5090). Best open-weight video quality achievable.
**If needing image-to-video for controlled generation** → Wan 2.1 I2V model (Wan2.1-I2V-14B) in ComfyUI. Provide start frame + motion prompt. I2V produces more controllable results — start frame anchors the scene, model only handles temporal evolution.
**Post-processing stack**: RIFE (frame interpolation, 5-15s) for motion smoothing. Real-ESRGAN (upscaling, 10-30s). FFmpeg encoding (H.264/H.265, CRF 18-23) in ComfyUI Video Combine node.
**What doesn't work**: [Diffusers](/tools/diffusers) native video pipelines experimental/unreliable for Wan/Hunyuan. [Transformers](/tools/transformers), [Ollama](/tools/ollama), [LM Studio](/tools/lm-studio), [Automatic1111](/tools/automatic1111) — none support video generation. ComfyUI is the single runtime.
Hardware buying guidance for Text-to-Video Generation
Local video generation is the most VRAM-hungry workload of 2026 — Hunyuan, Wan, and Mochi all need 24 GB minimum, with 32 GB unlocking longer clips.