DDDM-VC audio demo page
Overall framework of DDDM-VC
DDDM: 16,000 Hz / YourTTS: 16,000 Hz / NANSY: 22,050 Hz / NANSY++: 44,100 Hz
We conducted a zero-shot VC test using zero-shot samples in the NANSY and NANSY++ official demo page.
NANSY and NANSY++ are models trained with higher sampling rate data,
and for comparison, we have downsampled the samples to 16kHz.
Source Speaker | Target Speaker | Converted | ||
---|---|---|---|---|
male |
male |
DDDM-VC |
YourTTS |
NANSY |
male |
female |
DDDM-VC |
YourTTS |
NANSY |
female |
female |
DDDM-VC |
YourTTS |
NANSY |
female |
male |
DDDM-VC |
YourTTS |
NANSY |
Source Speaker | Target Speaker | Converted | ||
---|---|---|---|---|
female |
female |
DDDM-VC |
YourTTS |
NANSY++ |
male |
male |
DDDM-VC |
YourTTS |
NANSY++ |
Real-world Dataset
Source Speaker | Target Speaker | Converted | ||
---|---|---|---|---|
DDDM-VC-Fine-tuning |
DDDM-VC |
Source Speaker | Target Speaker | Converted | ||
---|---|---|---|---|
DDDM-VC-Fine-tuning |
DDDM-VC |
|||
DDDM-VC-Fine-tuning |
DDDM-VC |
Source Speaker | Target Speaker | Converted | ||
---|---|---|---|---|
226_007 |
DDDM-VC-Fine-tuning |
DDDM-VC |
||
DDDM-VC-Fine-tuning |
DDDM-VC |
|||
DDDM-VC-Fine-tuning |
DDDM-VC |
|||
DDDM-VC-Fine-tuning |
DDDM-VC |
|||
DDDM-VC-Fine-tuning |
DDDM-VC |
|||
DDDM-VC-Fine-tuning |
DDDM-VC |
|||
DDDM-VC-Fine-tuning |
DDDM-VC |
All speakers are seen during training
Source Speaker | Target Speaker | Converted | ||
---|---|---|---|---|
GT (1571) |
GT (3526) |
AutoVC |
VoiceMixer |
Speech Resynthesis |
DiffVC-6 |
DiffVC-30 | |||
DDDM-VC |
DDDM-VC | |||
DDDM-VC |
DDDM-VC | |||
GT (3699) |
GT (374) |
AutoVC |
VoiceMixer |
Speech Resynthesis |
DiffVC-6 |
DiffVC-30 | |||
DDDM-VC |
DDDM-VC | |||
DDDM-VC |
DDDM-VC |
All speakers are unseen during training
Source Speaker | Target Speaker | Converted | ||
---|---|---|---|---|
GT |
GT |
AutoVC |
VoiceMixer |
Speech Resynthesis |
DiffVC |
DiffVC | |||
DDDM-VC |
DDDM-VC | |||
DDDM-VC |
DDDM-VC | |||
DDDM-VC |
DDDM-VC | |||
GT |
GT |
AutoVC |
VoiceMixer |
Speech Resynthesis |
DiffVC |
DiffVC | |||
DDDM-VC |
DDDM-VC | |||
DDDM-VC |
DDDM-VC | |||
DDDM-VC |
DDDM-VC |
Unseen lagneguage and unseen speaker from the CSS10 multi-lingual dataset
Source Speaker | Target Speaker | Converted | |
---|---|---|---|
French |
Hungarian |
DDDM-VC |
|
French |
Greek |
DDDM-VC |
Source Speaker | Target Speaker | Converted | |
---|---|---|---|
Finnish |
Dutch |
DDDM-VC |
|
Finnish |
Russsian |
DDDM-VC |
Source Speaker | Target Speaker | Converted | |
---|---|---|---|
Russian |
Dutch |
DDDM-VC |
|
Russian |
French |
DDDM-VC |
Source Speaker | Target Speaker | Converted | |
---|---|---|---|
Spanish |
French |
DDDM-VC |
|
Spanish |
Russian |
DDDM-VC |
Source Speaker | Target Speaker | Converted | |
---|---|---|---|
German |
French |
DDDM-VC |
|
German |
Dutch |
DDDM-VC |
Multi-speaker TTS
Sentence 1 According to the count's directions, Danglars was waited on by Vampa, who brought him the best wine and fruits of Italy; then, having conducted him to the road, and pointed to the post chaise, left him leaning against a tree. |
|||
---|---|---|---|
GT |
VITS |
HierSpeech |
DDDM-TTS (Ours) |
Sentence 2 Deprived of the objects of both intellect and emotion, he could not proceed to his work. | |||
---|---|---|---|
GT |
VITS |
HierSpeech |
DDDM-TTS (Ours) |
Sentence 3 I doubt not but in acknowledgment you will make your deliverer your wife, as I have promised." He joyfully consented; but before they married, she changed my wife into a hind; and this is she whom you see here. |
|||
---|---|---|---|
GT |
VITS |
HierSpeech |
DDDM-TTS (Ours) |
Sentence 4 Suddenly Graham's knees bent beneath him, his arm against the pillar collapsed limply, he staggered forward and fell upon his face. |
|||
---|---|---|---|
GT |
VITS |
HierSpeech |
DDDM-TTS (Ours) |
Sentence 5 In modern times the sultans or rulers of Turkey have been commonly regarded as the caliphs. |
|||
---|---|---|---|
GT |
VITS |
HierSpeech |
DDDM-TTS (Ours) |
Sentence 6 (Zero-shot) When the sunlight strikes raindrops in the air, they act as a prism and form a rainbow. |
|||
---|---|---|---|
GT |
DDDM-TTS (Ours) |
Sentence 7 (Zero-shot) These take the shape of a long round arch, with its path high above, and its two ends apparently beyond the horizon. |
|||
---|---|---|---|
GT |
DDDM-TTS (Ours) |
Audio Mixing
Mix synthesized sound and speech where the original audio does not exist in the desired ratio.
Sound audio was generated using AudioLDM [H. Liu et al., 2023], and speech audio was generated using our DDDM-VC.
Synthesized Sound Audio | Synthesized Speech Audio | Mixed Audio (Ours) | |
---|---|---|---|
Techno music with a strong, upbeat tempo and high melodic riffs |
Converted voice |
||
A capella |
Converted voice |
||
Chopping potatos on a metal table. |
Converted voice |
||
The sound of a steam engine. |
Converted voice |
Many-to-many VC tasks with seen speaker from the LibriTTS dataset
Source Speaker | Target Speaker | Converted | ||
---|---|---|---|---|
GT |
GT |
|||
w.o Prior Mixup |
w.o Disentangled Denoiser | |||
w.o Normalized F0 |
w.o Data-driven Prior | |||
GT |
GT |
DDDM-VC (Ours) |
||
w.o Prior Mixup |
w.o Disentangled Denoiser | |||
w.o Normalized F0 |
w.o Data-driven Prior | |||
GT |
GT |
DDDM-VC (Ours) |
||
w.o Prior Mixup |
w.o Disentangled Denoiser | |||
w.o Normalized F0 |
w.o Data-driven Prior |
Source Speaker | Target Speaker | Pitch Control in VC scenario | |
---|---|---|---|
p225_007 |
p226_007 |
Converted sample #1 |
|
Converted sample #2 | |||
Converted sample #3 |
Source Speaker | Target Speaker | Filter | Source | Method | |
---|---|---|---|---|---|
p232 (VCTK) |
p232 (VCTK) |
Resynthesis | |||
p232 (VCTK) |
p231 (VCTK) |
Voice conversion | |||
Timbre control | |||||
Pitch control |