Video
talking head
lip sync video
ai avatar

Avatar Generation

Talking-head avatar video generation from audio + reference image. SadTalker, EMO, Hallo, AnimateDiff are open-weight options.

Setup walkthrough

  1. Install ComfyUI via Stability Matrix.
  2. ComfyUI Manager → Install Models → "sadtalker" (2 GB — face animation from audio) or "hallo" (5 GB — newer, better quality).
  3. For SadTalker: ComfyUI workflow takes (a) a portrait photo, (b) an audio file (WAV, 3-10 seconds of speech).
    • The model generates a talking-head video: face moves with the audio, lips sync, natural head motion.
    • Resolution: typically 256×256 (face crop). Upscale to 512×512 with GFPGAN/CodeFormer (install via ComfyUI Manager).
  4. First talking-head video in 30-90 seconds on 8+ GB GPU for a 5-second clip.
  5. For Hallo (better quality, more natural motion):
    • git clone https://github.com/fudan-generative-vision/hallo → follow setup instructions
    • Produces 512×512 talking heads with natural eye blinking, head movement, and expression variation
    • 1-3 minutes per 5-second clip on 12+ GB GPU
  6. Use cases: AI presenters, virtual assistants, character dialogue, video messages without filming.

The cheap setup

Used RTX 3060 12 GB (~$200-250, see /hardware/rtx-3060-12gb). Runs SadTalker at 30-60 seconds per 5-second clip. Hallo at 1-3 minutes per clip. For a 1-minute avatar video: ~10-30 minutes of generation. Pair with Ryzen 5 5600 + 32 GB DDR4 + 1TB NVMe. Total: ~$390-440. Avatar generation is moderate compute — faster than text-to-video by 10-50×. For simple talking heads (SadTalker), 8 GB cards handle it comfortably. Hallo benefits from 12+ GB for higher resolution outputs.

The serious setup

Used RTX 3090 24 GB (~$700-900, see /hardware/rtx-3090). Runs Hallo at 30-60 seconds per 5-second clip — near-real-time avatar generation. Can produce 10 minutes of avatar video in ~1-2 hours. For production avatar pipelines (AI news anchors, virtual customer service agents), batch generation overnight handles daily content needs. Total: ~$1,800-2,200. Avatar generation is not the bottleneck — audio recording and script writing take more time than GPU rendering. A single RTX 3060 handles most production avatar workloads.

Common beginner mistake

The mistake: Using a low-resolution webcam selfie as the reference photo, then wondering why the avatar looks pixelated and unnatural. Why it fails: SadTalker/Hallo work at fixed resolutions (256×256 or 512×512). A compressed, noisy webcam image at 480p has facial details already lost. When the model animates it, every artifact animates too — JPEG compression blocks dance across the face. The fix: Use a high-quality portrait photo: well-lit (soft diffuse light, no harsh shadows), neutral expression, looking at camera, 1024×1024+ resolution, sharp focus on eyes. The reference photo quality IS the avatar quality. For professional avatars, take the photo with a decent camera (phone in portrait mode, mirrorless camera with 85mm lens) in controlled lighting. Garbage portrait → garbage avatar. Good portrait → convincing avatar.

Reality check

Local video gen is genuinely possible in 2026 (LTX-Video, Mochi) but VRAM-hungry. 24 GB is the working minimum; 32 GB is the comfort zone for long-form workflows. Below 24 GB, video gen isn't realistic with current models.

Common mistakes

  • Trying video gen on 16 GB cards (model + KV cache doesn't fit)
  • Underestimating runtime VRAM (peak draw 1.5x model size on long sequences)
  • Mixing video gen with concurrent LLM serving on same GPU
  • Using Mac Silicon for video gen — viable but 30-50% slower than CUDA

What breaks first

The errors most operators hit when running avatar generation locally. Each links to a diagnose+fix walkthrough.

Before you buy

Verify your specific hardware can handle avatar generation before committing money.

Specialized buyer guides
Updated 2026 roundup