Replies: 10 comments 17 replies
-
I'm not an expert on this, and I think it would be hard to tell without running experiments. If I had to guess (seeing how similar models are scaled in for image generation), I would increase the number of resnet blocks |
Beta Was this translation helpful? Give feedback.
-
Thanks! I'll try those settings. So you'd leave the attention features/heads/etc the same? |
Beta Was this translation helpful? Give feedback.
-
Heads definitely, you could increase Btw let me know if you get good results and which setting you end up using :) |
Beta Was this translation helpful? Give feedback.
-
I stuck with the numbers you described in your first comment and left I also ran into a couple issues with my PC overheating (really hot this weekend!) and doubled the dataset size half way through, which explains the weird loss curves. Additionally, this doesn't include your more recent context_channels commit. Do you think it's worth resetting with the increased attention and context channels? Or seeing this one further through? Also, does this library use the VAE encoding trick Stable Diffusion uses to increase efficiency? |
Beta Was this translation helpful? Give feedback.
-
Thanks for sharing! The context channels commit is for some experiments I'm doing with conditioning so it's not necessary for unconditional generation. Hard to tell what's worth trying, I would wait for this experiment to be done and maybe run another where you only change the attention size to compare which is more influent. I tried the VAE to increase efficiency but it's very hard to train a good VAE. There's no good loss function for audio, and it's also hard to make diffusion work with that. I would leave that away for now if you don't what to do lots of experiments :) |
Beta Was this translation helpful? Give feedback.
-
Could one just steal the pretrained VQVAEs from OpenAI's Jukebox? Or is that type not useful for efficiency improvements like that of Stable Diffusion? |
Beta Was this translation helpful? Give feedback.
-
I'm not sure that would work since in order to add noise to the encoded input it needs to be in the range -1,1 with a mean of 0. Maybe if properly regularized |
Beta Was this translation helpful? Give feedback.
-
That makes sense. Is there any reason a pyramid of diffusers a-la Jukebox's transformer priors couldn't do a similar job? That was my plan once I got something resembling acceptable results out of this level. Also, is there a rule of thumb for when to end training? Or do people just wait until changes are no longer audible/visible? |
Beta Was this translation helpful? Give feedback.
-
I switched to the larger attention features version and am getting slightly more encouraging results: https://wandb.ai/zaptrem/diffusion-pop-4?workspace=user-zaptrem I think I should keep scaling. Is the learning rate falloff determined by number of epochs, or steps? |
Beta Was this translation helpful? Give feedback.
-
When you say a pyramid of diffusers do you mean like: a first diffusion model predicting a source at 12kHz, then a second upsampling that to 24kHz, and a third to 48kHz?
There isn't. I've noticed that sometimes, even if the loss seems to converge, the quality continues to improve a bit after that. It's hard to find a rule that always applies, since there's no good metric for audio quality.
That's very interesting! (For some reason, the provided link seems to be dead)
I didn't add any LR scheduler, but other people I think use InverseLR, CosineAnnealingLR, or LambdaLR scheduling. Also, ideally you would keep a second model with EMA from which you do the sampling so that it's more stable. See for example the trainer in imagen-pytorch. It's something I might add to the trainer in the future. Btw, I'm going to move this issue into the general discussion :) |
Beta Was this translation helpful? Give feedback.
-
Hello! I'm in the process of training a model on a top-40s dataset using your library. However, I want to experiment with long-term consistency, so I've scaled sample rate/channels accordingly to fit ~90s windows during training. I think my results could be improved by further scaling up the number of model parameters, but I'm not sure what to change and by what ratios to get the most bang for my buck/VRAM/compute. Do you guys have a "scaled" config you could share or a general guide (e.g., 2X attention heads, 1.5X mults) for this? Thanks!
Beta Was this translation helpful? Give feedback.
All reactions