DDDM-VC Audio Demo

Rebuttal

DDDM: 16,000 Hz / YourTTS: 16,000 Hz / NANSY: 22,050 Hz / NANSY++: 44,100 Hz We conducted a zero-shot VC test using zero-shot samples in the NANSY and NANSY++ official demo page. NANSY and NANSY++ are models trained with higher sampling rate data,
and for comparison, we have downsampled the samples to 16kHz.

Source Speaker	Target Speaker	Converted
male 1089_134691_000053_000001	male 2830_3979_000018_000009	DDDM-VC 16,000 Hz	YourTTS 16,000 Hz	NANSY 22,050 Hz
male 1089_134686_000002_000000	female 121_121726_000007_000003	DDDM-VC	YourTTS	NANSY
female 3570_5695_000004_000001	female 1284_1180_000008_000001	DDDM-VC	YourTTS	NANSY
female 5142_36600_000010_000004	male 7176_88083_000006_000004	DDDM-VC	YourTTS	NANSY

Source Speaker	Target Speaker	Converted
female p225	female p335	DDDM-VC 16,000 Hz	YourTTS 16,000 Hz	NANSY++ 44,100 Hz
male p347	male p326	DDDM-VC	YourTTS	NANSY++

One-shot Speaker Adaptation

Real-world Dataset

Source Speaker	Target Speaker	Converted
Emma Watson (03:30 ~ 03:40)	Gollum (00:30 ~ 00:40)	DDDM-VC-Fine-tuning (One-shot, 30 iter.)	DDDM-VC (Zero-shot, 30 iter.)

Source Speaker	Target Speaker	Converted
Glados (00:00 ~ 00:10)	Benedict Cumberbatch (05:00 ~ 05:10)	DDDM-VC-Fine-tuning (One-shot, 30 iter.)	DDDM-VC (Zero-shot, 30 iter.)
	Tom holland (00:45 ~ 00:55)	DDDM-VC-Fine-tuning (One-shot, 30 iter.)	DDDM-VC (Zero-shot, 30 iter.)

Source Speaker	Target Speaker	Converted
226_007 (VCTK)	Benedict Cumberbatch (05:00 ~ 05:10)	DDDM-VC-Fine-tuning (One-shot, 30 iter.)	DDDM-VC (Zero-shot, 30 iter.)
	Emma Watson (03:30 ~ 03:40)	DDDM-VC-Fine-tuning (One-shot, 30 iter.)	DDDM-VC (Zero-shot, 30 iter.)
	Tom holland (00:45 ~ 00:55)	DDDM-VC-Fine-tuning (One-shot, 30 iter.)	DDDM-VC (Zero-shot, 30 iter.)
	Gollum (00:30 ~ 00:40)	DDDM-VC-Fine-tuning (One-shot, 30 iter.)	DDDM-VC (Zero-shot, 30 iter.)
	Glados (00:00 ~ 00:10)	DDDM-VC-Fine-tuning (One-shot, 30 iter.)	DDDM-VC (Zero-shot, 30 iter.)
	Heung-min Son (00:04 ~ 00:14)	DDDM-VC-Fine-tuning (One-shot, 30 iter.)	DDDM-VC (Zero-shot, 30 iter.)
	Steve Jobs (00:55 ~ 01:05)	DDDM-VC-Fine-tuning (One-shot, 30 iter.)	DDDM-VC (Zero-shot, 30 iter.)

Many-to-Many Voice Conversion (LibriTTS)

All speakers are seen during training

Source Speaker	Target Speaker	Converted
GT (1571)	GT (3526)	AutoVC	VoiceMixer	Speech Resynthesis
		DiffVC-6	DiffVC-30
		DDDM-VC (Small, 6 iter.)	DDDM-VC (Small, 30 iter.)
		DDDM-VC (Base, 6 iter.)	DDDM-VC (Base, 30 iter.)
GT (3699)	GT (374)	AutoVC	VoiceMixer	Speech Resynthesis
		DiffVC-6	DiffVC-30
		DDDM-VC (Small, 6 iter.)	DDDM-VC (Small, 30 iter.)
		DDDM-VC (Base, 6 iter.)	DDDM-VC (Base, 30 iter.)

Zero-shot Voice Conversion (VCTK)

All speakers are unseen during training

Source Speaker	Target Speaker	Converted
GT p227 (male)	GT p229 (female)	AutoVC	VoiceMixer	Speech Resynthesis
		DiffVC (6 iter.)	DiffVC (30 iter.)
		DDDM-VC (Small, 6 iter.)	DDDM-VC (Small, 30 iter.)
		DDDM-VC (Base, 6 iter.)	DDDM-VC (Base, 30 iter.)
		DDDM-VC (Fine-tuning, 6 iter.)	DDDM-VC (Fine-tuning, 30 iter.)
		GT p236 (female)	GT p226 (male)	AutoVC	VoiceMixer	Speech Resynthesis
DiffVC (6 iter.)	DiffVC (30 iter.)
DDDM-VC (Small, 6 iter.)	DDDM-VC (Small, 30 iter.)
DDDM-VC (Base, 6 iter.)	DDDM-VC (Base, 30 iter.)
DDDM-VC (Fine-tuning, 6 iter.)	DDDM-VC (Fine-tuning, 30 iter.)

Zero-shot Cross-lingual Voice Conversion

Unseen lagneguage and unseen speaker from the CSS10 multi-lingual dataset

Source Speaker	Target Speaker	Converted
French	Hungarian	DDDM-VC
French	Greek	DDDM-VC

Source Speaker	Target Speaker	Converted
Finnish	Dutch	DDDM-VC
Finnish	Russsian	DDDM-VC

Source Speaker	Target Speaker	Converted
Russian	Dutch	DDDM-VC
Russian	French	DDDM-VC

Source Speaker	Target Speaker	Converted
Spanish	French	DDDM-VC
Spanish	Russian	DDDM-VC

Source Speaker	Target Speaker	Converted
German	French	DDDM-VC
German	Dutch	DDDM-VC

DDDM-TTS

Multi-speaker TTS

Sentence 1 According to the count's directions, Danglars was waited on by Vampa, who brought him the best wine and fruits of Italy; then, having conducted him to the road, and pointed to the post chaise, left him leaning against a tree.
GT	VITS	HierSpeech	DDDM-TTS (Ours)

Sentence 1

According to the count's directions, Danglars was waited on by Vampa, who brought him the best wine and fruits of Italy; then, having conducted him to the road, and pointed to the post chaise, left him leaning against a tree.

GT

VITS

HierSpeech

DDDM-TTS (Ours)

Sentence 2 Deprived of the objects of both intellect and emotion, he could not proceed to his work.
GT	VITS	HierSpeech	DDDM-TTS (Ours)

Sentence 2

Deprived of the objects of both intellect and emotion, he could not proceed to his work.

GT

VITS

HierSpeech

DDDM-TTS (Ours)

Sentence 3 I doubt not but in acknowledgment you will make your deliverer your wife, as I have promised." He joyfully consented; but before they married, she changed my wife into a hind; and this is she whom you see here.
GT	VITS	HierSpeech	DDDM-TTS (Ours)

Sentence 3

I doubt not but in acknowledgment you will make your deliverer your wife, as I have promised." He joyfully consented; but before they married, she changed my wife into a hind; and this is she whom you see here.

GT

VITS

HierSpeech

DDDM-TTS (Ours)

Sentence 4 Suddenly Graham's knees bent beneath him, his arm against the pillar collapsed limply, he staggered forward and fell upon his face.
GT	VITS	HierSpeech	DDDM-TTS (Ours)

Sentence 4

Suddenly Graham's knees bent beneath him, his arm against the pillar collapsed limply, he staggered forward and fell upon his face.

GT

VITS

HierSpeech

DDDM-TTS (Ours)

Sentence 5 In modern times the sultans or rulers of Turkey have been commonly regarded as the caliphs.
GT	VITS	HierSpeech	DDDM-TTS (Ours)

Sentence 5

In modern times the sultans or rulers of Turkey have been commonly regarded as the caliphs.

GT

VITS

HierSpeech

DDDM-TTS (Ours)

Sentence 6 (Zero-shot) When the sunlight strikes raindrops in the air, they act as a prism and form a rainbow.
GT	DDDM-TTS (Ours)

Sentence 6 (Zero-shot)

When the sunlight strikes raindrops in the air, they act as a prism and form a rainbow.

GT

DDDM-TTS (Ours)

Sentence 7 (Zero-shot) These take the shape of a long round arch, with its path high above, and its two ends apparently beyond the horizon.
GT	DDDM-TTS (Ours)

Sentence 7 (Zero-shot)

These take the shape of a long round arch, with its path high above, and its two ends apparently beyond the horizon.

GT

DDDM-TTS (Ours)

DDDM-Mixer

Audio Mixing
Mix synthesized sound and speech where the original audio does not exist in the desired ratio. Sound audio was generated using AudioLDM [H. Liu et al., 2023], and speech audio was generated using our DDDM-VC.

Synthesized Sound Audio	Synthesized Speech Audio	Mixed Audio (Ours)
Techno music with a strong, upbeat tempo and high melodic riffs	Converted voice (Female)
A capella	Converted voice (Male)
Chopping potatos on a metal table.	Converted voice Gollum
The sound of a steam engine.	Converted voice (Male)

Ablation study

Many-to-many VC tasks with seen speaker from the LibriTTS dataset

Source Speaker	Target Speaker	Converted
GT 3699 (male)	GT 3526 (female)	DDDM-VC (Ours)
		w.o Prior Mixup	w.o Disentangled Denoiser
		w.o Normalized F0	w.o Data-driven Prior
GT 1603 (male)	GT 1571 (male)	DDDM-VC (Ours)
		w.o Prior Mixup	w.o Disentangled Denoiser
		w.o Normalized F0	w.o Data-driven Prior
GT 3440 (female)	GT 1639 (male)	DDDM-VC (Ours)
		w.o Prior Mixup	w.o Disentangled Denoiser
		w.o Normalized F0	w.o Data-driven Prior

Style Control

Source Speaker	Target Speaker	Pitch Control in VC scenario
p225_007	p226_007	Converted sample #1
		Converted sample #2
		Converted sample #3

Source Speaker	Target Speaker	Filter	Source	Method
p232 (VCTK)	p232 (VCTK)	Source spk emb	Source spk emb	Resynthesis
p232 (VCTK)	p231 (VCTK)	Target spk emb	Target spk emb	Voice conversion
		Target spk emb	Source spk emb	Timbre control
		Source spk emb	Target spk emb	Pitch control