DDDM-VC audio demo page DDDM-VC Audio Demo

DDDM-VC: Decoupled Denoising Diffusion Models with Disentangled Representation and Prior Mixup for Verified Robust Voice Conversion

 

main Overall framework of DDDM-VC


Rebuttal

DDDM: 16,000 Hz / YourTTS: 16,000 Hz / NANSY: 22,050 Hz / NANSY++: 44,100 Hz We conducted a zero-shot VC test using zero-shot samples in the NANSY and NANSY++ official demo page. NANSY and NANSY++ are models trained with higher sampling rate data,
and for comparison, we have downsampled the samples to 16kHz.

Source Speaker Target Speaker Converted

male
1089_134691_000053_000001

male
2830_3979_000018_000009

DDDM-VC
16,000 Hz

YourTTS
16,000 Hz

NANSY
22,050 Hz

male
1089_134686_000002_000000

female
121_121726_000007_000003

DDDM-VC

YourTTS

NANSY

female
3570_5695_000004_000001

female
1284_1180_000008_000001

DDDM-VC

YourTTS

NANSY

female
5142_36600_000010_000004

male
7176_88083_000006_000004

DDDM-VC

YourTTS

NANSY

Source Speaker Target Speaker Converted

female
p225

female
p335

DDDM-VC
16,000 Hz

YourTTS
16,000 Hz

NANSY++
44,100 Hz

male
p347

male
p326

DDDM-VC

YourTTS

NANSY++

One-shot Speaker Adaptation

Real-world Dataset

Source Speaker Target Speaker Converted

Emma Watson
(03:30 ~ 03:40)

Gollum
(00:30 ~ 00:40)

DDDM-VC-Fine-tuning
(One-shot, 30 iter.)

DDDM-VC
(Zero-shot, 30 iter.)

Source Speaker Target Speaker Converted

Glados
(00:00 ~ 00:10)

Benedict Cumberbatch
(05:00 ~ 05:10)

DDDM-VC-Fine-tuning
(One-shot, 30 iter.)

DDDM-VC
(Zero-shot, 30 iter.)

Tom holland
(00:45 ~ 00:55)

DDDM-VC-Fine-tuning
(One-shot, 30 iter.)

DDDM-VC
(Zero-shot, 30 iter.)

Source Speaker Target Speaker Converted

226_007
(VCTK)

Benedict Cumberbatch
(05:00 ~ 05:10)

DDDM-VC-Fine-tuning
(One-shot, 30 iter.)

DDDM-VC
(Zero-shot, 30 iter.)

Emma Watson
(03:30 ~ 03:40)

DDDM-VC-Fine-tuning
(One-shot, 30 iter.)

DDDM-VC
(Zero-shot, 30 iter.)

Tom holland
(00:45 ~ 00:55)

DDDM-VC-Fine-tuning
(One-shot, 30 iter.)

DDDM-VC
(Zero-shot, 30 iter.)

Gollum
(00:30 ~ 00:40)

DDDM-VC-Fine-tuning
(One-shot, 30 iter.)

DDDM-VC
(Zero-shot, 30 iter.)

Glados
(00:00 ~ 00:10)

DDDM-VC-Fine-tuning
(One-shot, 30 iter.)

DDDM-VC
(Zero-shot, 30 iter.)

Heung-min Son
(00:04 ~ 00:14)

DDDM-VC-Fine-tuning
(One-shot, 30 iter.)

DDDM-VC
(Zero-shot, 30 iter.)

Steve Jobs
(00:55 ~ 01:05)

DDDM-VC-Fine-tuning
(One-shot, 30 iter.)

DDDM-VC
(Zero-shot, 30 iter.)

Many-to-Many Voice Conversion (LibriTTS)

All speakers are seen during training

Source Speaker Target Speaker Converted

GT (1571)

GT (3526)

  AutoVC

  VoiceMixer

Speech Resynthesis

DiffVC-6

DiffVC-30

DDDM-VC
(Small, 6 iter.)

DDDM-VC
(Small, 30 iter.)

DDDM-VC
(Base, 6 iter.)

DDDM-VC
(Base, 30 iter.)

 GT (3699)

 GT (374)

 AutoVC

 VoiceMixer

 Speech Resynthesis

 DiffVC-6

 DiffVC-30

DDDM-VC
(Small, 6 iter.)

DDDM-VC
(Small, 30 iter.)

DDDM-VC
(Base, 6 iter.)

DDDM-VC
(Base, 30 iter.)

Zero-shot Voice Conversion (VCTK)

All speakers are unseen during training

Source Speaker Target Speaker Converted

GT
p227 (male)

GT
p229 (female)

AutoVC

VoiceMixer

Speech Resynthesis

DiffVC
(6 iter.)

DiffVC
(30 iter.)

DDDM-VC
(Small, 6 iter.)

DDDM-VC
(Small, 30 iter.)

DDDM-VC
(Base, 6 iter.)

DDDM-VC
(Base, 30 iter.)

DDDM-VC
(Fine-tuning, 6 iter.)

DDDM-VC
(Fine-tuning, 30 iter.)

GT
p236 (female)

GT
p226 (male)

AutoVC

VoiceMixer

Speech Resynthesis

DiffVC
(6 iter.)

DiffVC
(30 iter.)

DDDM-VC
(Small, 6 iter.)

DDDM-VC
(Small, 30 iter.)

DDDM-VC
(Base, 6 iter.)

DDDM-VC
(Base, 30 iter.)

DDDM-VC
(Fine-tuning, 6 iter.)

DDDM-VC
(Fine-tuning, 30 iter.)

Zero-shot Cross-lingual Voice Conversion

Unseen lagneguage and unseen speaker from the CSS10 multi-lingual dataset

Source Speaker Target Speaker Converted

French

Hungarian

DDDM-VC

French

Greek

DDDM-VC

Source Speaker Target Speaker Converted

Finnish

Dutch

DDDM-VC

Finnish

Russsian

DDDM-VC

Source Speaker Target Speaker Converted

Russian

Dutch

DDDM-VC

Russian

French

DDDM-VC

Source Speaker Target Speaker Converted

Spanish

French

DDDM-VC

Spanish

Russian

DDDM-VC

Source Speaker Target Speaker Converted

German

French

DDDM-VC

German

Dutch

DDDM-VC

DDDM-TTS

Multi-speaker TTS

Sentence 1

According to the count's directions, Danglars was waited on by Vampa, who brought him the best wine and fruits of Italy; then, having conducted him to the road, and pointed to the post chaise, left him leaning against a tree.

GT

VITS

HierSpeech

DDDM-TTS (Ours)

Sentence 2

Deprived of the objects of both intellect and emotion, he could not proceed to his work.

GT

VITS

HierSpeech

DDDM-TTS (Ours)

Sentence 3

I doubt not but in acknowledgment you will make your deliverer your wife, as I have promised." He joyfully consented; but before they married, she changed my wife into a hind; and this is she whom you see here.

GT

VITS

HierSpeech

DDDM-TTS (Ours)

Sentence 4

Suddenly Graham's knees bent beneath him, his arm against the pillar collapsed limply, he staggered forward and fell upon his face.

GT

VITS

HierSpeech

DDDM-TTS (Ours)

Sentence 5

In modern times the sultans or rulers of Turkey have been commonly regarded as the caliphs.

GT

VITS

HierSpeech

DDDM-TTS (Ours)

Sentence 6 (Zero-shot)

When the sunlight strikes raindrops in the air, they act as a prism and form a rainbow.

GT

DDDM-TTS (Ours)

Sentence 7 (Zero-shot)

These take the shape of a long round arch, with its path high above, and its two ends apparently beyond the horizon.

GT

DDDM-TTS (Ours)

DDDM-Mixer

Audio Mixing
Mix synthesized sound and speech where the original audio does not exist in the desired ratio. Sound audio was generated using AudioLDM [H. Liu et al., 2023], and speech audio was generated using our DDDM-VC.

Synthesized Sound Audio Synthesized Speech Audio Mixed Audio (Ours)

Techno music with a strong, upbeat tempo and high melodic riffs

Converted voice
(Female)


A capella

Converted voice
(Male)


Chopping potatos on a metal table.

Converted voice
Gollum


The sound of a steam engine.

Converted voice
(Male)


Ablation study

Many-to-many VC tasks with seen speaker from the LibriTTS dataset

Source Speaker Target Speaker Converted

GT
3699 (male)

GT
3526 (female)

DDDM-VC
(Ours)

w.o Prior Mixup

w.o Disentangled Denoiser

w.o Normalized F0

w.o Data-driven Prior

GT
1603 (male)

GT
1571 (male)

DDDM-VC
(Ours)

w.o Prior Mixup

w.o Disentangled Denoiser

w.o Normalized F0

w.o Data-driven Prior

GT
3440 (female)

GT
1639 (male)

DDDM-VC
(Ours)

w.o Prior Mixup

w.o Disentangled Denoiser

w.o Normalized F0

w.o Data-driven Prior

Style Control

Source Speaker Target Speaker Pitch Control in VC scenario

p225_007

p226_007

Converted sample #1

Converted sample #2

Converted sample #3

Source Speaker Target Speaker Filter Source Method

p232 (VCTK)

p232 (VCTK)

Source spk emb

Source spk emb

Resynthesis

p232 (VCTK)

p231 (VCTK)

Target spk emb

Target spk emb

Voice conversion

Target spk emb

Source spk emb

Timbre control

Source spk emb

Target spk emb

Pitch control