Motion Transfer
Applying motion from a source video to a target subject — pose-driven dance generation, lip-sync, gesture transfer.
Setup walkthrough
- Install ComfyUI via Stability Matrix.
- ComfyUI Manager → Install Models → search "animate-diff" or "mimicmotion" for pose-driven animation.
- For pose-driven motion transfer (dance, gestures):
Approach 1 — AnimateDiff + ControlNet:
- Load a reference image (the character you want to animate)
- Load a driving video (the motion source — could be a dance video)
- Extract pose from driving video with DWPose (ComfyUI node)
- Feed pose sequence to AnimateDiff + ControlNet → character moves with the driving pose
- 16 frames (~1 second) in 30-60 seconds on 12 GB GPU
Approach 2 — MimicMotion (more recent, better quality):
pip install mimicmotion→ feed reference image + driving video → outputs animated character- Better temporal consistency than AnimateDiff
- First motion-transferred clip in 1-5 minutes per second of output.
The cheap setup
Used RTX 3060 12 GB ($200-250, see /hardware/rtx-3060-12gb). Runs AnimateDiff + ControlNet (pose) at 30-60 seconds per 16-frame clip (1 second of animation at 16 fps). For a 3-second dance clip: ~2-3 minutes. Pair with Ryzen 5 5600 + 32 GB DDR4 + 1TB NVMe. Total: ~$390-440. Motion transfer is lighter than full video generation — AnimateDiff is an SD 1.5-based method, only needs 4-6 GB VRAM. At $400, you can transfer motion to characters reliably at near-real-time for short clips.
The serious setup
Used RTX 3090 24 GB (~$700-900, see /hardware/rtx-3090). Runs AnimateDiff + multiple ControlNets (pose + depth) at 20-40 seconds per 16-frame clip. For longer sequences (5-10 seconds), the extra VRAM prevents the temporal artifacts that occur when AnimateDiff runs out of context memory. For SDXL-based animation (AnimateDiff-XL): 24 GB handles 32-frame sequences smoothly. Total: ~$1,800-2,200. Motion transfer is significantly lighter than text-to-video generation — even 8 GB cards handle basic workflows.
Common beginner mistake
The mistake: Using a driving video of a professional dancer doing complex full-body spins and jumps, expecting a static portrait photo to animate with the same motion. Why it fails: The motion model sees a source video with extreme pose changes (arms behind back, crouching, jumping) and a reference image in a neutral standing pose. The model can't map extreme poses to an image that lacks the corresponding body parts visible. The legs disappear behind the body — the model hallucinates limbs. The fix: Match the driving video to the reference image. If your reference is a standing portrait, use a driving video of someone nodding, talking, or making small gestures. If you need complex dance motion, the reference image should show the full body in a neutral dance pose. The model maps pose-to-pose — if the driving pose has limbs in positions not visible in the reference, you get artifacts. Garbage in, garbage out applies doubly to motion transfer.
Recommended setup for motion transfer
Browse all tools for runtimes that fit this workload.
Reality check
Local video gen is genuinely possible in 2026 (LTX-Video, Mochi) but VRAM-hungry. 24 GB is the working minimum; 32 GB is the comfort zone for long-form workflows. Below 24 GB, video gen isn't realistic with current models.
Common mistakes
- Trying video gen on 16 GB cards (model + KV cache doesn't fit)
- Underestimating runtime VRAM (peak draw 1.5x model size on long sequences)
- Mixing video gen with concurrent LLM serving on same GPU
- Using Mac Silicon for video gen — viable but 30-50% slower than CUDA
What breaks first
The errors most operators hit when running motion transfer locally. Each links to a diagnose+fix walkthrough.
Before you buy
Verify your specific hardware can handle motion transfer before committing money.