Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion

TLDR

A cascading latent diffusion approach that can generate multiple minutes of high-quality stereo music at 48kHz from textual descriptions is developed, targeting real-time on a single consumer GPU.