Audio-to-audio translation #14
-
Hi, I'm coming from the computer vision area and trying to build a domain adaptation tool for singing enhancement. I've worked a lot with image-to-image translation using paired examples and am trying to do the same with audio. I have a dataset of paired audio samples and am using the UNet architecture provided to try and generate I'm not familiar with diffusion. Does it make sense to incorporate the diffusion loss into my setup? If so, how would I do so? Currently I'm using a combination of L1 and L2 loss and the WaveGAN discriminator. It's sorta learning the translation, but there's a lot of noise. Any advice appreciated. Here's a short code snippet
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 4 replies
-
I don't think it would make sense to mix diffusion with GANs. If you need only the from audio_diffusion_pytorch import AudioDiffusionModel
model = AudioDiffusionModel(
in_channels=1,
context_channels=[1]
)
# Train model with pairs of audio sources, i.e. predict target given source
source = torch.randn(1, 1, 2 ** 18) # [batch, in_channels, samples], 2**18 ≈ 12s of audio at a frequency of 22050
target = torch.randn(1, 1, 2 ** 18)
loss = model(target, channels_list=[source])
loss.backward() # Do this many times
# Sample a target audio given start noise and source audio
source = torch.randn(1, 1, 2 ** 18)
noise = torch.randn(1, 1, 2 ** 18)
sampled = model.sample(
channels_list=[source]
noise=noise,
num_steps=25, # Suggested range: 2-50
) # [2, 1, 2 ** 18] Where |
Beta Was this translation helpful? Give feedback.
-
The goal of the original question, to map one audio file onto another is the same for me, however the API's seem to have changed in the mean time. Is there anyway the sample code could be updated with the newer API's? Cheers |
Beta Was this translation helpful? Give feedback.
I don't think it would make sense to mix diffusion with GANs. If you need only the
UNet1d
it makes sense to use it the way you are using it. However, I'm quite confident you could solve this with a pure diffusion based approach (and get much better results) as follows: