Image Generation
txt2img
image generation
ai art
stable diffusion

Text-to-Image Generation

Generating images from text prompts. The canonical creative AI workload — Flux, SDXL, Stable Diffusion 3.5, Playground v3 lead the open-weight tier.

Capability notes

Open-weight text-to-image in 2026 splits into three VRAM-defined tiers. The Flux family (Black Forest Labs) leads: Flux Schnell (4-step distilled, Apache 2.0, 12 GB VRAM) outputs 1024×1024 in 1.5–3 seconds on [consumer GPUs](/hardware/rtx-4090) with best-in-class text rendering — embedded text is legible at 12pt+, a 3× improvement over SDXL. Flux Dev (50-step, non-commercial, 16 GB VRAM) adds fine prompt adherence at 8–15 second generation time. Flux Pro (API-only, 24 GB+ for equivalent quality) handles 2048×2048 with ControlNet guidance. Stable Diffusion 3.5 Large (8B params, permissive license) excels at photorealistic portraits, natural lighting, and skin texture at 1024×1024 — MMDIT architecture handles human faces with 40% fewer anatomical errors than SDXL's UNet. Weakness: text rendering is garbled on ~60% of outputs, making it unsuitable for posters or marketing assets with embedded text. [SDXL](/tools/diffusers) (2.6B+ params) remains the most widely-supported open-weight model with the largest fine-tuned ecosystem — thousands of LoRAs, ControlNets, and community fine-tunes. SDXL is the default starting point at 8 GB VRAM minimum. Resolution ceiling: 1024×1024 natively, 1536×1536 with high-res fix. Quality differentiators: prompt adherence, text rendering, photorealism, anatomical correctness, style consistency. No single model leads across all. Flux dominates text rendering and prompt adherence. SD3.5 dominates photorealism. SDXL dominates ecosystem breadth. Match model to output requirement, not to benchmark leaderboards.

If you just want to try this

Lowest-friction path to a working setup.

Install [ComfyUI](/tools/comfyui) via Stability Matrix (stabilitymatrix.com → download → one-click install) on Windows, or via pinokio (pinokio.ai → search "ComfyUI") on any OS. Both bundle ComfyUI with GPU detection — no manual Python setup. Once launched (browser tab at localhost:8188), download Flux Schnell: 1. ComfyUI Manager → Install Models → search "flux1-schnell" → download safetensors (~23 GB). 2. ComfyUI Manager → Install Custom Nodes → search "ComfyUI-GGUF" for FP8/NF4 quantization. 3. Load the default Flux workflow from ComfyUI's workflow library. 4. Set width=1024, height=1024, steps=4, guidance=3.5, type prompt → Queue Prompt. On [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb): 3–5 seconds per image. [RTX 4090](/hardware/rtx-4090): 1.5–2.5 seconds. [RTX 3060 12GB](/hardware/rtx-3060-12gb) with FP8 via GGUF: 6–10 seconds. If Flux exceeds VRAM, fall back to SDXL. Download "sd_xl_base_1.0.safetensors" via ComfyUI Manager (6.9 GB). SDXL runs on 8 GB GPUs at 8–15 seconds for 1024×1024. Quality is lower on prompt adherence and text rendering, but the LoRA ecosystem (character styles, art styles, specific subjects) is 10× larger. Alternative: [LM Studio](/tools/lm-studio) + "Stable Diffusion WebUI" plugin gives an A1111-style UI without manual Python/CUDA setup. LM Studio handles model download and GPU config in one application.

For production deployment

Operator-grade recommendation.

Production image generation requires GPU sizing for throughput and OOM monitoring. Throughput = (60 / seconds-per-image) × batch-size. ComfyUI + Flux Schnell (4 steps, 1024×1024, FP8): - [RTX 4090](/hardware/rtx-4090) (24 GB): 25–35 images/min at batch=1, 55–70 at batch=4. VRAM saturated at batch=4 (22.5 GB). - [RTX 5090](/hardware/rtx-5090) (32 GB): 40–55 images/min at batch=1, 80–100 at batch=4. Saturation at batch=6 (30 GB). - [RTX 6000 Ada](/hardware/rtx-6000-ada) (48 GB): 30–40 at batch=1, 80–110 at batch=8. Larger batches despite lower bandwidth than 5090. - [L40S](/hardware/nvidia-l40s) (48 GB): identical profile — datacenter SKU. For Flux Dev (50 steps): divide throughput by 6–8×. Use Dev only when Schnell's 4-step distillation produces visible artifacts on your specific prompt domain — ~15% of prompts requiring precise spatial composition. **API vs self-host.** Flux Pro via Replicate ~$0.05/image at 1024×1024. SD3.5 Turbo ~$0.003/image. At 10,000 images/month, API = $50–500/month. Self-hosted [RTX 4090](/hardware/rtx-4090) (~$250/month amortized) breaks even at Flux Pro tier. At 100,000 images/month, self-hosted [L40S](/hardware/nvidia-l40s) (~$400–600/month cloud rental) saves 60–80% vs API. **Production architecture.** ComfyUI in API mode (`comfyui --listen --port 8188`) behind NGINX with Redis job queue. Each API call accepts prompt + optional reference + workflow template, queues to GPU worker, returns image URL. Version-control workflows as JSON in git. Separate GPU workers by workload: 24 GB for Flux Schnell batch=4 (throughput), 48 GB for Flux Dev + ControlNet combos (quality), 8–12 GB for SDXL LoRA style-specific batches. **OOM management.** VRAM scales quadratically with resolution. Monitor per-workflow VRAM, set hard batch-size caps. Retry queue: on OOM, halve batch size and retry. Set max resolution per GPU tier: 8 GB → 1024×1024, 16 GB → 1536×1536, 24 GB → 2048×2048. Flux at 2048 needs 24–32 GB at FP8 — will OOM on 16 GB cards.

What breaks

Failure modes operators see in the wild.

**OOM at high resolutions.** Symptom: CUDA OOM when resolution exceeds GPU capacity. 2048×2048 Flux Schnell FP16 = 22–28 GB; + ControlNet = +4–8 GB; + LoRA = +1–3 GB. Cause: VRAM scales quadratically (2× dimensions = 4× VRAM). Mitigation: FP8/NF4 quantization cuts VRAM 40–50% (ComfyUI-GGUF), limit max resolution per GPU in workflow config, use tiled VAE decoding. For batch: process sequentially when VRAM-tight — batching 4×1024 uses 3.5× more VRAM than sequential. **CFG scale artifacts.** Symptom: CFG above 7–8 produces oversaturated colors, burnt highlights, unnatural contrast. Cause: high CFG pushes too far from unconditional path — "CFG burn." Mitigation: CFG 3.5–5.0 for Flux family, cap at 7 for SDXL, use dynamic thresholding (CFG Rescale node in ComfyUI). **Face and limb distortion.** Symptom: 6+ fingers, merged limbs, asymmetrical eyes. SD3.5 reduces to ~15% of outputs (vs SDXL's ~25%). Cause: diffusion models denoise patches independently — no architectural guarantee of anatomy. Mitigation: use SD3.5 or Flux for humans (DiT architectures handle global coherence better), apply face-detailed LoRAs, use ADetailer (auto face inpainting), batch-generate 5–10 variants and filter by aesthetic-scoring model. **Text gibberish in images.** Symptom: embedded text reads as random characters. SDXL: ~10% legible; SD3.5: ~40%; Flux: ~85%. Cause: patch-based denoising doesn't produce character-level stroke coherence. Flux's MMDIT encodes character-level representations. Mitigation: use Flux for text-containing images, keep text short (1–5 words), composite generated images with real text in post-processing for commercial use. **NSFW filter false positives.** Symptom: innocuous prompts containing "girl," "body," anatomical terms blocked. SD3.5's safety classifier blocks ~3–5% of benign prompts. Cause: keyword matching without semantic understanding. Mitigation: rephrase to avoid flagged keywords (community maintains lists), use SDXL community fine-tunes lacking safety classifier (verify license), deploy custom workflows bypassing the filter node. **Prompt bleed in batch generation.** Symptom: previous-prompt elements appear in subsequent images — jacket color bleeds, background style persists. Cause: diffusion samplers can carry residual noise patterns between consecutive generations. Mitigation: randomize seed per generation, clear GPU cache between batch runs, toggle "Clear Cache" node in persistent ComfyUI workflows between batches.

Hardware guidance

**Hobbyist tier ($300–600).** [RTX 3060 12GB](/hardware/rtx-3060-12gb): SDXL at 12–20 seconds — functional but slow. Flux Schnell FP8 via GGUF: 8–15 seconds — VRAM at floor. [Intel Arc B580](/hardware/intel-arc-b580) at 12 GB: SDXL 20–30 seconds via IPEX. [RX 7600 XT](/hardware/rx-7600-xt) at 16 GB: SDXL 15–25 seconds via DirectML/ROCm — 30–50% slower than NVIDIA at same tier. **Enthusiast tier ($1,500–2,500).** [RTX 4090](/hardware/rtx-4090) at 24 GB: the image generation king — Flux Schnell 1.5–2.5s, Flux Dev 8–12s, SDXL 2–4s. Fits Flux Dev + 1 ControlNet + 2 LoRAs. [RTX 5090](/hardware/rtx-5090) at 32 GB: Flux Dev + 2 ControlNets + 3 LoRAs + 2048 upscale — best single-card consumer gen machine. [RTX 5080](/hardware/rtx-5080) at 16 GB: Flux Schnell 3–5s but cannot fit Flux Dev + ControlNet combos — 16 GB restrictive for professional workflows. [RX 7900 XTX](/hardware/rx-7900-xtx) at 24 GB: SDXL 5–8s via ROCm on Linux — AMD stack has ComfyUI/Flux rough edges (node compatibility, fp8 gaps). **Professional tier ($6,000–15,000).** [RTX 6000 Ada](/hardware/rtx-6000-ada) at 48 GB: fits Flux Dev + 3 ControlNets + 5 LoRAs + 2048 output simultaneously. Batch=8 at 1024 → 100+ images/min. [L40S](/hardware/nvidia-l40s) at 48 GB: datacenter equivalent, better sustained thermals. **Enterprise tier ($25,000+).** [A100 80GB SXM](/hardware/nvidia-a100-80gb-sxm): all workflows with headroom — Flux Pro at 4096, multi-ControlNet, batch=16. [H100](/hardware/nvidia-h100-pcie) is less suited — image gen is bandwidth-bound, not compute-bound. Enterprise pick maximizes VRAM/$: A100 80 GB for 2048+ resolution, L40S 48 GB for 1024–1536 throughput. VRAM is the binding constraint — more important than bandwidth or TFLOPS. If a workflow exceeds VRAM, it crashes. If it fits, time varies linearly with steps ÷ bandwidth. For throughput: prioritize VRAM → batch size → images/min. For low-latency: prioritize bandwidth → per-image speed.

Runtime guidance

**ComfyUI vs Automatic1111 vs Diffusers — workflow paradigm comparison.** [ComfyUI](/tools/comfyui) uses a node-based graph editor: each operation (load model, encode prompt, sample, decode, upscale, save) is a node connected by edges. This is the production standard in 2026. Advantages: workflows are JSON files — version-controllable, shareable, reproducible. Node architecture enables complex pipelines (multi-model, multi-ControlNet, upscale chains) impossible in linear UIs. API mode serves as production backend — POST workflow JSON + prompt = image. Tradeoffs: learning curve for node graph, debugging invisible edges, community fragmented across 500+ custom node packages. [Automatic1111 WebUI](/tools/automatic1111) uses a linear tab interface: txt2img → img2img → extras. This was the default from 2022–2024 but is largely superseded. Advantages: simpler learning curve, one-click installers, enormous tutorial base. Tradeoffs: linear workflows cannot express multi-model pipelines, modal UI can't show full pipeline at once, development slowed — last major update 1.6.0 (October 2024). Acceptable for casual use; not production-grade. **SD WebUI Forge** (A1111 fork by lllyasviel, ControlNet author) optimizes memory management for 30–50% lower VRAM. Forge handles UNet offloading, VAE tiling, and gradient checkpointing more aggressively. Drop-in upgrade for A1111 users with VRAM constraints — identical UI, same extensions, lower VRAM. Right choice for 8–12 GB GPUs running SDXL. [Diffusers](/tools/diffusers) (Hugging Face) is the Python API for programmatic generation — no GUI, pure code. Use for API-level control: dynamic prompt scheduling, custom sampling, model merging at inference, integration into larger Python apps. Loads any Hugging Face model (Flux, SDXL, SD3.5) with same API. Tradeoffs: write Python for every pipeline, no visual workflow debugging, manual GPU memory management. **Decision tree.** Primary: [ComfyUI](/tools/comfyui) — standard, community momentum, scales from hobbyist to production API. [Automatic1111](/tools/automatic1111) or Forge: only for existing A1111 workflows with simple pipelines (txt2img, img2img, no ControlNets). [Diffusers](/tools/diffusers): custom applications wrapping image gen in business logic. Production serving: ComfyUI API mode + Python client — workflow JSON defines pipeline, application handles queuing/auth/delivery. **Model loading.** ComfyUI's "Load Checkpoint" loads full model to VRAM at startup.

Setup walkthrough

  1. Install ComfyUI via Stability Matrix (stabilitymatrix.com → download → one-click install).
  2. Open http://localhost:8188 in browser.
  3. ComfyUI Manager → Install Models → search "flux1-schnell" → download (~23 GB).
  4. Load the default Flux workflow from the workflow library (or drag-and-drop from comfyanonymous.github.io).
  5. Set width=1024, height=1024, steps=4 (Schnell is a distilled model — few steps needed).
  6. Type prompt: "A photorealistic cat wearing a wizard hat, reading a spellbook in a cozy library." Click Queue.
  7. First image in 2-5 seconds on RTX 3090/4090 24 GB.

Alternative (lighter): ComfyUI Manager → install SDXL 1.0 (~6.6 GB). Steps=30, CFG=7. First image in 8-15 seconds on RTX 3060 12 GB.

The cheap setup

Used RTX 3060 12 GB ($200-250, see /hardware/rtx-3060-12gb). Runs SDXL at 1024×1024 in 8-15 seconds per image. Flux Schnell runs but requires FP8 quant (12 GB via ComfyUI-GGUF custom node) at 15-25 seconds per 1024×1024. Pair with Ryzen 5 5600 ($90 used) + 32 GB DDR4 ($50) + 1TB NVMe ($50). Total: ~$390-440. For lighter weight: RTX 4060 8 GB ($280 new) runs SD 1.5 at 2-3s and SDXL at 12-20s.

The serious setup

Used RTX 3090 24 GB ($700-900, see /hardware/rtx-3090). Runs Flux Schnell at 2-4s per 1024×1024, Flux Dev at 10-15s (steps=20). Can train LoRAs on Flux (20 min per LoRA on 20 images). SDXL at 3-5s per image. Pair with Ryzen 7 7700X + 64 GB DDR5 + 2TB NVMe. Total: $1,800-2,200. RTX 4090 24 GB ($1,600 used, see /hardware/rtx-4090) drops Flux Dev to 5-8s — ~40% faster than 3090.

Common beginner mistake

The mistake: Downloading the FP16 Flux Dev checkpoint on a 12 GB card and wondering why it runs out of memory. Why it fails: FP16 Flux Dev is 23 GB — it doesn't fit in 12 GB VRAM. ComfyUI silently falls back to system RAM, making generation take 5+ minutes instead of seconds. The fix: Download the FP8 quantized version via ComfyUI-GGUF or the official Flux FP8 checkpoint (12 GB). On 12 GB cards, this fits entirely in VRAM and runs at 15-25 seconds per 1024×1024 image. Always check "what's loaded in VRAM" in ComfyUI's system stats panel.

Reality check

Image gen is compute-bound, not bandwidth-bound. VRAM matters for the resolution + LoRA training stack, but FP16 TFLOPS is what decides Flux throughput. The 5080's compute advantage over 5070 Ti shows here in ways it doesn't on LLM inference.

Common mistakes

  • Buying for VRAM ceiling without checking compute (16 GB Flux Dev FP16 doesn't fit anyway)
  • Skipping LoRA training requirements (24 GB minimum, 32 GB comfortable for Flux)
  • Underestimating ComfyUI's multi-model VRAM appetite vs A1111's single-pipeline
  • Using Q4 quantized image models — quality drop is more visible than on LLMs

What breaks first

The errors most operators hit when running text-to-image generation locally. Each links to a diagnose+fix walkthrough.

Before you buy

Verify your specific hardware can handle text-to-image generation before committing money.

Hardware buying guidance for Text-to-Image Generation

Local image generation is VRAM-bound — your card decides which checkpoints you can actually run. The guides below cover the buyer decision per checkpoint family.

Specialized buyer guides
Updated 2026 roundup