Learning accent representation with multi-level VAE towards controllable speech synthesis
|Title||Learning accent representation with multi-level VAE towards controllable speech synthesis|
|Publication Type||Conference Paper|
|Year of Publication||2023|
|Authors||Melechovsky J., Mehrish A., Herremans D., Sisman B.|
|Conference Name||IEEE Spoken Language Technology (SLT) Workshop|
|Conference Location||Doha, Quatar|
Accent is a crucial aspect of speech that helps define one's identity. We note that the state-of-the-art Text-to-Speech (TTS) systems can achieve high-quality generated voice, but still lack in terms of versatility and customizability. Moreover, they generally do not take into account accent, which is an important feature of speaking style. In this work, we utilize the concept of Multi-level VAE (ML-VAE) to build a control mechanism that aims to disentangle accent from a reference accented speaker; and to synthesize voices in different accents such as English, American, Irish, and Scottish. The proposed framework can also achieve high-quality accented voice generation for multi-speaker setup, which we believe is remarkable. We investigate the performance through objective metrics and conduct listening experiments for a subjective performance assessment. We showed that the proposed method achieves good performance for naturalness, speaker similarity, and accent similarity.