Image Editing

Modifying existing images via prompts or masks. Distinct from generation — instruction-based editing models (Flux Fill, ControlNet, IP-Adapter) excel here.

Capability notes

Open-weight image editing in 2026 uses three mechanisms: **inpainting** (regenerating masked regions), **ControlNet-guided generation** (structural conditioning via edge maps, depth maps, pose skeletons), and **IP-Adapter** (style/subject injection from reference images). Combined, they form composite editing pipelines. **Flux Fill** (Black Forest Labs) leads inpainting/outpainting quality. On the MAGIC-BRUSH benchmark, Flux Fill scores 85-90 vs SDXL-inpainting's 72-78. Flux Fill produces fewer boundary artifacts (visible seams) than SDXL, particularly on complex textures (hair, foliage, text). The model was trained specifically on masked-image completion, not retrofitted from text-to-image. **ControlNet** conditions generation on structural inputs. Canny edge preserves outlines while restyling interiors. Depth preserves spatial layout for scene editing. OpenPose preserves human pose for character design. **IP-Adapter** injects visual style from a reference image via cross-attention — one product reference applied to multiple scenes maintains brand consistency. **What open-weight editing cannot do well mid-2026**: (1) Non-destructive layer-based editing like Photoshop — diffusion models produce pixels, not editable layers. (2) Text modification in images — changing "30% off" to "50% off" typically corrupts rendering. (3) Multi-shot consistent editing across perspectives — exact identity preservation across frames is video-editing territory. (4) Resolution-independence beyond 4K without tiling artifacts — requires 24GB+ VRAM. **The ComfyUI ecosystem** is the practical standard. Pre-built workflows on Civitai provide drag-and-drop pipelines combining Flux Fill + ControlNet + IP-Adapter. [Automatic1111](/tools/automatic1111) supports inpainting via extensions but less flexibly for multi-model chaining. [Diffusers](/tools/diffusers) provides programmatic access — the choice for application integration.

If you just want to try this

Lowest-friction path to a working setup.

Download [ComfyUI](https://github.com/comfyanonymous/ComfyUI) — node graph interface with steeper learning curve than [Automatic1111](/tools/automatic1111) but the only tool supporting Flux Fill + ControlNet + IP-Adapter without fighting extensions. Accept the 30-60 minute learning curve. Download a pre-built Flux Fill inpainting workflow from Civitai (search "Flux Fill inpainting ComfyUI workflow"), import the JSON into ComfyUI, load Flux Dev Fill checkpoint from Hugging Face or Civitai. You now have working inpainting: upload image, draw mask over region to edit, type prompt, generate. First generation: 30-90 seconds on [RTX 4070](/hardware/rtx-4070). Subsequent: 15-45 seconds. Inpainting prompt format: "a [object] in [style], matching the lighting and perspective of the surrounding scene, photorealistic, seamless integration." The lighting/perspective clauses reduce boundary artifacts by guiding the model to align with existing context. Minimum hardware: 12 GB VRAM ([RTX 3060 12GB](/hardware/rtx-3060-12gb), [RTX 4070](/hardware/rtx-4070)) for Flux Dev Fill at 1024×1024 (45-90s). 16 GB ([RTX 5070 Ti](/hardware/rtx-5070-ti)) for comfortable 1024×1024 with ControlNet headroom. 24 GB ([RTX 4090](/hardware/rtx-4090)) for 1536×1536 or combined Flux Fill + ControlNet + IP-Adapter. Don't learn ControlNet and IP-Adapter on day one. Start with basic inpainting → understand mask quality (crisp edges, feathering) → add ControlNet for structural conditioning → add IP-Adapter for style injection.

For production deployment

Operator-grade recommendation.

Production image editing requires solving batch throughput, quality consistency, and integration. **Batch editing pipeline**: Wrap ComfyUI in API mode (supports `/prompt` endpoint for workflow execution). Submit workflow JSON + image + mask + prompt via REST → receive output image. Queue: Redis-backed with priority (real-time before batch). Multiple ComfyUI workers behind load balancer, each pinned to dedicated GPU. **Quality consistency**: Diffusion models are stochastic — same mask+prompt produces different results. Options: (1) Fixed seed per job (deterministic, $0 overhead). (2) Generate N candidates (3-5) per edit, auto-select best via CLIP-IQA or aesthetic scorer. (3) Human-in-the-loop for high-stakes edits (product photos) with review queue. **Throughput economics**: Flux Fill on [RTX 4090](/hardware/rtx-4090) produces ~30-60 1024×1024 inpaintings/hour (60-120s each). For 1,000 product photo edits: 17-34 GPU-hours. Cloud rental at $2-4/hour = $34-136 per 1,000 edits vs manual Photoshop at $5-20/edit ($5,000-20,000 per 1,000). 4× [RTX 4090](/hardware/rtx-4090) in parallel: 120-240 edits/hour. **Integration patterns**: For e-commerce: product photo → SAM (Segment Anything Model) for automatic mask generation → ComfyUI inpainting with "product on white background, studio lighting" → output. For content moderation: flag images → detection model mask → inpaint with "empty space, matching surroundings" → review before publishing. **IP-Adapter for brand consistency**: Feed brand style reference image → all edits maintain brand colors, typography feel, and photographic style. Closest open-weight equivalent to Adobe's "Generative Fill with Style Reference." **VRAM planning**: Flux Dev Fill (FP16) baseline = ~12 GB. + ControlNet = +2-3 GB each. + IP-Adapter = +1-2 GB. + 1024×1024 buffer = +4-6 GB. Total for Fill+1 ControlNet+IP-Adapter = 19-23 GB — fits 24 GB, tight. Flux+2 ControlNets+IP-Adapter = 21-28 GB — needs 32 GB card or FP8 quantization.

What breaks

Failure modes operators see in the wild.

- **Inpaint boundary artifacts (visible seams).** Generated region doesn't blend — visible edge with color/texture/lighting mismatch. Mitigation: feather mask edges (5-15px Gaussian blur), use mask dilation (2-5px), apply Poisson blending post-generation. For extreme cases, second inpainting pass on boundary region. - **ControlNet conditioning failure on extreme poses.** Pose or structure outside training distribution — model ignores conditioning and generates plausible but unconstrained output. Mitigation: pre-validate ControlNet inputs (physically impossible joint angles flagged). Combine multiple ControlNets (pose+depth) to reinforce conditioning. - **IP-Adapter style bleed.** Reference image content leaks into generation — a red car reference produces car-shaped objects in all outputs. Mitigation: reduce IP-Adapter weight (0.3-0.6 not 0.7-1.0). Use CLIP-based style-content separation preprocessor extracting style while suppressing content. - **Resolution mismatch between mask and generation.** Mask prepared at 512×512, generation at 1024×1024 — resizing produces aliased/jagged boundaries. Mitigation: prepare masks at exact generation resolution. ComfyUI native nodes handle this; manual Photoshop/GIMP imports must match. - **Color/lighting inconsistency.** Inpainted region has different white balance, color temperature, light direction — looks composited. Mitigation: include lighting description in prompt. Post-process with color histogram matching between surrounding region and generated region. Flux Fill's built-in context-aware conditioning improves this natively. - **Mask quality is output quality.** Poor masks cascade into poor results regardless of model quality. Mitigation: use SAM for automatic masks, manually refine edges. Budget 10-30 seconds mask refinement per image for production quality.

Hardware guidance

**Hobbyist ($600-$1,500)**: [RTX 3060 12GB](/hardware/rtx-3060-12gb) or [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb). 12 GB = absolute minimum for Flux Dev Fill at 1024×1024 (60-90s). 16 GB = comfortable with one ControlNet or IP-Adapter at 1024×1024. [AMD RX 7800 XT](/hardware/rx-7800-xt) 16 GB works on Linux via ROCm — ComfyUI has reasonable ROCm support with 10-20% more configuration friction vs NVIDIA. [Apple M4 Pro 24GB](/hardware/apple-m4-pro) runs Flux Dev via MLX-based Diffusers port — functional but 2-4× slower than equivalent NVIDIA GPU. **SMB ($2,000-$4,000)**: [RTX 4090 24GB](/hardware/rtx-4090) or [RTX 5090 32GB](/hardware/rtx-5090). 4090: Flux Fill + 1 ControlNet + IP-Adapter at 1024×1024, 30-60s. 5090 32 GB: Flux Fill + 2 ControlNets + IP-Adapter at 1536×1536. Single-card sweet spot for professional editing workstation. **Enterprise ($8,000-$25,000)**: [RTX A6000](/hardware/rtx-a6000) 48 GB or [NVIDIA L40S](/hardware/nvidia-l40s) 48 GB for batch editing server. 2-3 concurrent ComfyUI workers per card. 4× [L40S](/hardware/nvidia-l40s): 8-12 concurrent editing sessions. Tier for e-commerce processing 10,000+ product photos daily. **Apple Silicon**: [Mac Studio M3 Ultra 128GB](/hardware/mac-studio-m3-ultra) runs Flux Fill + ControlNet on MLX — slower (2-4×) but unified memory enables larger resolutions and more simultaneous ControlNets than any single NVIDIA consumer card. Flux Schnell Fill (distilled, faster, lower quality) reduces VRAM baseline from 12 GB to ~7 GB.

Runtime guidance

**If you need GUI for interactive editing** → [ComfyUI](https://github.com/comfyanonymous/ComfyUI). Canonical tool for open-weight image editing — supports Flux Fill, SDXL inpainting, ControlNet (all variants), IP-Adapter, and multi-model workflow chaining. Download pre-built workflows from Civitai — don't build from scratch. **If you want simpler basic inpainting only** → [Automatic1111](/tools/automatic1111) Inpaint tab. Supports SDXL and SD 1.5 inpainting. Flux Fill requires extensions. Simpler learning curve but less flexible. Choose A1111 for basic inpainting; ComfyUI for ControlNet + IP-Adapter + multi-model workflows. **If building programmatic pipeline** → [Hugging Face Diffusers](/tools/diffusers). Supports Flux Fill (`FluxFillPipeline`), ControlNet, IP-Adapter with consistent API. Python script: load pipeline, accept image+mask+prompt, return edited image. Deploy behind FastAPI. Diffusers handles VRAM management better than raw model loading. **If needing automatic masking** → SAM (Segment Anything Model) via Hugging Face or ComfyUI SAM nodes. Automatic mask generation produces all object masks without user input — filter by size/position. SAM adds 1-2 GB VRAM; run on CPU if GPU memory tight. **If on macOS** → ComfyUI with MPS (Metal Performance Shaders) backend. 2-4× slower vs CUDA. [Draw Things](https://drawthings.ai) (Mac-native Stable Diffusion app) for basic inpainting with simpler UI. ComfyUI with MPS for ControlNet/IP-Adapter pipelines. **If batch processing at scale** → ComfyUI headless (API server) + job queue. Deploy with `--listen` flag. Multiple instances behind nginx, each pinned to GPU via `CUDA_VISIBLE_DEVICES`. Cloud: RunPod/Vast.ai/Lambda GPU instances with ComfyUI Docker image.

Setup walkthrough

Install ComfyUI via Stability Matrix (stabilitymatrix.com → download → one-click).
ComfyUI Manager → Install Models → search "flux1-fill-dev" → download (~23 GB).
Load Flux Fill workflow (ComfyUI workflow library or comfyanonymous.github.io examples).
In the workflow: (a) load your source image → Load Image node, (b) connect to a mask (paint the region to edit), (c) type prompt: "Replace the background with a sunset beach," (d) set steps=20, guidance=3.5.
Queue → first edited image in 10-20 seconds on 24 GB GPU.
For lighter editing (instruct-based): use SDXL + InstructPix2Pix (~6 GB total). Prompt: "Make it night time." No mask needed — the model follows instructions.
For face/character editing: add IP-Adapter (ComfyUI Manager → install IP-Adapter) to preserve identity.

The cheap setup

Used RTX 3060 12 GB ($200-250, see /hardware/rtx-3060-12gb). Runs SDXL + ControlNet + IP-Adapter for instruction-based editing at 10-20 seconds per image. Flux Fill at FP8 (12 GB via GGUF) runs at 20-35 seconds per 1024×1024 edit. Pair with Ryzen 5 5600 + 32 GB DDR4 + 1TB NVMe. Total: ~$390-440. For simple edits (color adjustments, background swaps), 12 GB handles everything. Complex multi-ControlNet workflows benefit from 16+ GB.

The serious setup

Used RTX 3090 24 GB ($700-900, see /hardware/rtx-3090). Runs Flux Fill Dev at 8-15 seconds per 1024×1024 edit (FP16 in VRAM). Can combine Flux Fill + ControlNet + IP-Adapter simultaneously for precise edits while preserving identity and structure. SDXL editing workflows at 3-5 seconds per image. Pair with Ryzen 7 7700X + 64 GB DDR5 + 2TB NVMe. Total: ~$1,800-2,200. RTX 4090 24 GB ($1,600 used, see /hardware/rtx-4090) drops Flux Fill to 3-6 seconds — ~50% faster.

Common beginner mistake

The mistake: Using basic img2img (denoising the whole image) to make a specific edit like "change the shirt color to blue" and getting an entirely different image. Why it fails: img2img with high denoising adds random noise across the entire image — it changes everything, not just the shirt. The fix: Use inpainting with a mask. Paint a mask only over the shirt, then prompt "blue shirt." The model only generates within the masked region. For instruction-based editing without masks, use Flux Fill or InstructPix2Pix — these are trained to follow instructions while preserving unmasked/unreferenced regions. Mask-based editing is the foundation; instruction-based is the convenience layer on top.

Recommended setup for image editing

Recommended hardware

Best GPU for Stable Diffusion + image gen →

Compute-bound workload — VRAM + FP16 TFLOPS both matter.

Recommended runtimes

Browse all tools for runtimes that fit this workload.

Budget build

AI PC under $1,000 →

Best GPU for this task

Best GPU for Stable Diffusion + image gen →

Reality check

Image gen is compute-bound, not bandwidth-bound. VRAM matters for the resolution + LoRA training stack, but FP16 TFLOPS is what decides Flux throughput. The 5080's compute advantage over 5070 Ti shows here in ways it doesn't on LLM inference.

Common mistakes

Buying for VRAM ceiling without checking compute (16 GB Flux Dev FP16 doesn't fit anyway)
Skipping LoRA training requirements (24 GB minimum, 32 GB comfortable for Flux)
Underestimating ComfyUI's multi-model VRAM appetite vs A1111's single-pipeline
Using Q4 quantized image models — quality drop is more visible than on LLMs

What breaks first

The errors most operators hit when running image editing locally. Each links to a diagnose+fix walkthrough.

Before you buy

Verify your specific hardware can handle image editing before committing money.

Buyer guides

Compare hardware

Troubleshooting

Specialized buyer guides

Updated 2026 roundup

Best free local AI tools (2026) →