Audio
music ai
song generation
audio generation

Music Generation

Generating music from text prompts or melody references. MusicGen, Stable Audio, Suno-clone open-weight models.

Setup walkthrough

  1. pip install audiocraft (Meta's MusicGen — the leading open-weight music generation model).
  2. On first run, from audiocraft.models import MusicGen will download the model (~1.5 GB for small, ~3.5 GB for medium, ~7 GB for large).
  3. Python script:
from audiocraft.models import MusicGen
import soundfile as sf
model = MusicGen.get_pretrained("facebook/musicgen-medium")
model.set_generation_params(duration=30)  # 30-second track
wav = model.generate(["Upbeat electronic dance music with a driving bassline, 120 BPM, synth melody, energetic drop"])
sf.write("output.wav", wav[0].cpu().numpy().T, model.sample_rate)
  1. First 30-second track in 10-30 seconds on GPU, 1-3 minutes on CPU.
  2. For melody-conditioned generation: provide a reference melody (humming, whistling, piano) as input — MusicGen follows the melody while applying the text-prompted style.
  3. For longer compositions: generate in 30-second segments and crossfade. MusicGen medium produces coherent music with clear genre fidelity.
  4. Alternative: Stable Audio Open (~1 GB, pip install stable-audio-tools) — better for ambient/soundscape, worse for structured music.

The cheap setup

MusicGen Small (1.5 GB) runs on CPU at 2-5× real-time — a 30-second track in 1-2 minutes. Any $300 laptop handles this. MusicGen Medium (3.5 GB) on a used GTX 1060 6 GB ($60) generates 30 seconds in 10-15 seconds — near real-time. For a full music generation rig: GTX 1060 6 GB ($60) + refurbished PC ($150) + 16 GB RAM ($30). Total: ~$240. Music generation is one of the most accessible creative AI tasks — even CPU-only laptops produce usable results. The bottleneck is your musical taste and prompt engineering, not hardware.

The serious setup

Used RTX 3060 12 GB ($200-250, see /hardware/rtx-3060-12gb) is more than sufficient for production music generation. MusicGen Large (7 GB) generates 30 seconds in 5-10 seconds — faster than real-time. Can batch-generate 100 tracks/hour for music library building. For melody-conditioned generation (hum a tune → full production), an RTX 3060 handles it in near-real-time. Total build: ~$700-900. Music generation is VRAM-light and fast — even entry-level GPUs handle the largest open-weight models. Spend your budget on studio monitors and a MIDI keyboard, not GPU.

Common beginner mistake

The mistake: Generating a 30-second track with MusicGen, thinking "this sounds great," generating 10 more with different prompts, then trying to arrange them into a song — discovering every track has wildly different key, tempo, and mix. Why it fails: MusicGen generates independent clips. Clip 1 might be in C minor at 120 BPM with heavy reverb. Clip 2 is in E major at 140 BPM with dry production. Nothing matches. The fix: Treat MusicGen as an idea generator, not a song producer. Export stems. Import into a DAW (Ableton, FL Studio, Reaper). Use MusicGen to generate individual elements (bassline in Cm at 120 BPM, drum loop, synth pad) with consistent prompts specifying key and tempo. Arrange, mix, and master in the DAW. AI generates raw material; you produce the track. Professional AI-assisted music production is AI + DAW, never AI alone.

Recommended setup for music generation

Recommended runtimes

Browse all tools for runtimes that fit this workload.

Reality check

Audio models are surprisingly forgiving on hardware. Whisper, Coqui, OpenAI Whisper-cpp all run well on 8-12 GB GPUs. The bottleneck is rarely the GPU; it's audio preprocessing and disk I/O for batch transcription.

Common mistakes

  • Overspending on GPU for audio-only workflows (8-12 GB is enough for Whisper)
  • Running audio + LLM concurrently without budgeting VRAM
  • Using fp32 weights when fp16 / int8 give 2-3x speedup with no quality loss
  • Forgetting audio preprocessing eats CPU cycles — a fast SSD helps more than expected

What breaks first

The errors most operators hit when running music generation locally. Each links to a diagnose+fix walkthrough.

Before you buy

Verify your specific hardware can handle music generation before committing money.

Hardware buying guidance for Music Generation

Voice cloning, TTS, and audio generation models trade VRAM for output quality — most operators undersize here.

Specialized buyer guides
Updated 2026 roundup