Coarse-to-Fine Text-to-Music Latent Diffusion
Title | Coarse-to-Fine Text-to-Music Latent Diffusion |
Publication Type | Conference Paper |
Year of Publication | 2024 |
Authors | Lanzendörfer L.A., Lu T., Perraudin N., Herremans D., Wattenhofer R. |
Conference Name | Audio Imagination: NeurIPS 2024 Workshop |
Conference Location | Vancouver |
Abstract | We introduce DiscoDiff, a text-to-music generative model that utilizes two latent diffusion models to produce high-fidelity 44.1kHz music hierarchically. Our approach significantly enhances audio quality through a coarse-to-fine generation strategy, leveraging residual vector quantization from the Descript Audio Codec. We consolidate this coarse-to-fine design through an important observation that the audio latent representation of can be split into primary and secondary part, controlling music contents and details accordingly. We validate the effectiveness of our approach and text-audio alignment through various objective metrics. Furthermore, we provide access to high-quality synthetic captions for the MTG-Jamendo and FMA datasets, as well as open-sourcing DiscoDiff's codebase and model checkpoints. |