Audio-to-audio translation #14

cameronfabbri · 2022-09-11T18:16:30Z

cameronfabbri
Sep 11, 2022

Hi, I'm coming from the computer vision area and trying to build a domain adaptation tool for singing enhancement. I've worked a lot with image-to-image translation using paired examples and am trying to do the same with audio. I have a dataset of paired audio samples and am using the UNet architecture provided to try and generate audio_y given audio_x. I also don't understand the use of t so have just set it to 1.

I'm not familiar with diffusion. Does it make sense to incorporate the diffusion loss into my setup? If so, how would I do so? Currently I'm using a combination of L1 and L2 loss and the WaveGAN discriminator. It's sorta learning the translation, but there's a lot of noise. Any advice appreciated.

Here's a short code snippet

t = torch.tensor([0.])
loss_l1 = torch.nn.L1Loss()
loss_l2 = torch.nn.MSELoss()
crit = torch.nn.BCEWithLogitsLoss()

net_g = UNet1d(
    in_channels=num_channels,
    channels=128,
    patch_size=16,
    kernel_sizes_init=[1, 3, 7],
    multipliers=[1, 2, 4, 4, 4, 4, 4],
    factors=[4, 4, 4, 2, 2, 2],
    attentions=[False, False, False, True, True, True],
    num_blocks=[2, 2, 2, 2, 2, 2],
    attention_heads=8,
    attention_features=64,
    attention_multiplier=2,
    use_attention_bottleneck=True,
    resnet_groups=8,
    kernel_multiplier_downsample=2,
    use_nearest_upsample=False,
    use_skip_scale=True)

net_d = models.WaveGANDiscriminator(
    model_size=64,
    num_channels=2,
    shift_factor=2,
    alpha=0.2,
    slice_len=2**16,
    use_batch_norm=False)

def step_g(self, batch_x, batch_y):
    self.optimizer_g.zero_grad()
    batch_g = self.net_g(batch_x, self.t)
    l1_loss = 100 * self.loss_l1(batch_y, batch_g)
    l2_loss = 100 * self.loss_l2(batch_y, batch_g)
    d_fake = self.net_d(torch.cat([batch_x, batch_g], axis=1))
    g_loss = self.crit(d_fake, torch.ones((d_fake.shape)).to(self.device))
    total_loss = l1_loss + l2_loss + g_loss
    total_loss.backward()
    self.optimizer_g.step()
    return l1_loss, l2_loss, g_loss, total_loss, batch_g

def step_d(self, batch_x, batch_y):
    self.optimizer_d.zero_grad()
    batch_g = self.net_g(batch_x, self.t)
    d_real = self.net_d(torch.cat([batch_x, batch_y], axis=1))
    d_fake = self.net_d(torch.cat([batch_x, batch_g.detach()], axis=1))
    loss_d_real = self.crit(d_real, torch.ones((d_real.shape)).to(self.device))
    loss_d_fake = self.crit(d_fake, torch.zeros((d_real.shape)).to(self.device))
    loss_d = loss_d_real + loss_d_fake
    loss_d.backward()
    self.optimizer_d.step()
    return loss_d

for batch in dataloader:
    audio_x, audio_y = batch # Both shape (batch_size, 1, 2**16)
    l1_loss, l2_loss, g_loss, total_loss, batch_g = self.step_g(audio_x, audio_y)
    d_loss = self.step_d(audio_x, audio_y)

Answered by flavioschneider

Sep 11, 2022

I don't think it would make sense to mix diffusion with GANs. If you need only the UNet1d it makes sense to use it the way you are using it. However, I'm quite confident you could solve this with a pure diffusion based approach (and get much better results) as follows:

from audio_diffusion_pytorch import AudioDiffusionModel

model = AudioDiffusionModel(
    in_channels=1,
    context_channels=[1]
)

# Train model with pairs of audio sources, i.e. predict target given source 
source = torch.randn(1, 1, 2 ** 18) # [batch, in_channels, samples], 2**18 ≈ 12s of audio at a frequency of 22050
target = torch.randn(1, 1, 2 ** 18)
loss = model(target, channels_list=[source])
loss.backward() # Do t…

View full answer

flavioschneider · 2022-09-11T21:29:58Z

flavioschneider
Sep 11, 2022
Maintainer

I don't think it would make sense to mix diffusion with GANs. If you need only the UNet1d it makes sense to use it the way you are using it. However, I'm quite confident you could solve this with a pure diffusion based approach (and get much better results) as follows:

from audio_diffusion_pytorch import AudioDiffusionModel

model = AudioDiffusionModel(
    in_channels=1,
    context_channels=[1]
)

# Train model with pairs of audio sources, i.e. predict target given source 
source = torch.randn(1, 1, 2 ** 18) # [batch, in_channels, samples], 2**18 ≈ 12s of audio at a frequency of 22050
target = torch.randn(1, 1, 2 ** 18)
loss = model(target, channels_list=[source])
loss.backward() # Do this many times

# Sample a target audio given start noise and source audio 
source = torch.randn(1, 1, 2 ** 18)
noise = torch.randn(1, 1, 2 ** 18)
sampled = model.sample(
    channels_list=[source]
    noise=noise,
    num_steps=25, # Suggested range: 2-50
) # [2, 1, 2 ** 18]

Where source is audio_x and target is audio_y. The basic idea is that during diffusion, the UNet1d is called multiple times to turn noise into an audio source (while being conditioned on the source, in this case), and the input t of the UNet1d is used to tell it how much noise to remove. You should read a bit about diffusion online to understand all the details.

3 replies

cameronfabbri Sep 12, 2022
Author

This is great, thank you so much

cameronfabbri Sep 14, 2022
Author

@flavioschneider So quality-wise this is working much better, there's basically zero noise in the generated output, and qualitatively I can say the audio generated "sounds like" the target singer. However, it's mostly gibberish and seems to be (almost) ignoring the context. Any thoughts?

flavioschneider Sep 14, 2022
Maintainer

Can you share the experiment? Might be that the model is not big enough if the data is very complex, or that you are training on very short sequence and therefore the attention layers in the bottleneck don't have enough context vectors to work with.

jtwigg · 2023-01-28T06:20:26Z

jtwigg
Jan 28, 2023

The goal of the original question, to map one audio file onto another is the same for me, however the API's seem to have changed in the mean time. AudioDiffusionModel no longer exists, and its closest neighbor DiffusionModel seems to have a very different constructor.

Is there anyway the sample code could be updated with the newer API's?
That would unlock my attempt.

Cheers

1 reply

jtwigg Jan 28, 2023

I made an attempt here https://github.com/jtwigg/audio-diffusion-test to blend two samples as source and target
but was only able to get noise out of it.

Key file: https://github.com/jtwigg/audio-diffusion-test/blob/main/test.py

I had to revert to 0.0.29 to get the code to resemble what you've listed

Any light you could shed on where I've gone wrong, I'd appreciate.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Audio-to-audio translation #14

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Audio-to-audio translation #14

cameronfabbri Sep 11, 2022

Replies: 2 comments · 4 replies

flavioschneider Sep 11, 2022 Maintainer

cameronfabbri Sep 12, 2022 Author

cameronfabbri Sep 14, 2022 Author

flavioschneider Sep 14, 2022 Maintainer

jtwigg Jan 28, 2023

jtwigg Jan 28, 2023

cameronfabbri
Sep 11, 2022

Replies: 2 comments 4 replies

flavioschneider
Sep 11, 2022
Maintainer

cameronfabbri Sep 12, 2022
Author

cameronfabbri Sep 14, 2022
Author

flavioschneider Sep 14, 2022
Maintainer

jtwigg
Jan 28, 2023