Coarse-to-Fine Text-to-Music Latent Diffusion

TitleCoarse-to-Fine Text-to-Music Latent Diffusion
Publication TypeConference Paper
Year of Publication2024
AuthorsLanzendörfer L.A., Lu T., Perraudin N., Herremans D., Wattenhofer R.
Conference NameAudio Imagination: NeurIPS 2024 Workshop
Conference LocationVancouver
Abstract

We introduce DiscoDiff, a text-to-music generative model that utilizes two latent diffusion models to produce high-fidelity 44.1kHz music hierarchically. Our approach significantly enhances audio quality through a coarse-to-fine generation strategy, leveraging residual vector quantization from the Descript Audio Codec. We consolidate this coarse-to-fine design through an important observation that the audio latent representation of can be split into primary and secondary part, controlling music contents and details accordingly. We validate the effectiveness of our approach and text-audio alignment through various objective metrics. Furthermore, we provide access to high-quality synthetic captions for the MTG-Jamendo and FMA datasets, as well as open-sourcing DiscoDiff's codebase and model checkpoints.