Question: Scaling guide/suggested parameters? #6

zaptrem · 2022-08-26T14:53:56Z

zaptrem
Aug 26, 2022

Hello! I'm in the process of training a model on a top-40s dataset using your library. However, I want to experiment with long-term consistency, so I've scaled sample rate/channels accordingly to fit ~90s windows during training. I think my results could be improved by further scaling up the number of model parameters, but I'm not sure what to change and by what ratios to get the most bang for my buck/VRAM/compute. Do you guys have a "scaled" config you could share or a general guide (e.g., 2X attention heads, 1.5X mults) for this? Thanks!

flavioschneider · 2022-08-26T19:41:18Z

flavioschneider
Aug 26, 2022
Maintainer

I'm not an expert on this, and I think it would be hard to tell without running experiments. If I had to guess (seeing how similar models are scaled in for image generation), I would increase the number of resnet blocks num_blocks: [2, 2, 2, 3, 3, 3] or num_blocks: [2, 2, 2, 4, 4, 4] depending on how large you want to go. You could also play with multipliers to increase the number of channels, e.g. multipliers: [1, 2, 4, 4, 4, 8, 8].

0 replies

zaptrem · 2022-08-26T20:45:20Z

zaptrem
Aug 26, 2022
Author

Thanks! I'll try those settings. So you'd leave the attention features/heads/etc the same?

0 replies

flavioschneider · 2022-08-26T21:23:12Z

flavioschneider
Aug 26, 2022
Maintainer

Heads definitely, you could increase attention_features to 128 then you would have a total of 128*8=1024 hidden features, which matches Imagen if I'm not wrong. They use twice the number of attention hidden features as channels (since if you have 128 channels with a multiplier of 4 that would be 512 channels).

Btw let me know if you get good results and which setting you end up using :)

0 replies

zaptrem · 2022-08-28T22:46:31Z

zaptrem
Aug 28, 2022
Author

I stuck with the numbers you described in your first comment and left attention_features untouched. You can follow the results here: https://wandb.ai/zaptrem/diffusion-pop-3?workspace=user-zaptrem

I also ran into a couple issues with my PC overheating (really hot this weekend!) and doubled the dataset size half way through, which explains the weird loss curves. Additionally, this doesn't include your more recent context_channels commit. Do you think it's worth resetting with the increased attention and context channels? Or seeing this one further through?

Also, does this library use the VAE encoding trick Stable Diffusion uses to increase efficiency?

0 replies

flavioschneider · 2022-08-28T22:59:08Z

flavioschneider
Aug 28, 2022
Maintainer

Thanks for sharing! The context channels commit is for some experiments I'm doing with conditioning so it's not necessary for unconditional generation. Hard to tell what's worth trying, I would wait for this experiment to be done and maybe run another where you only change the attention size to compare which is more influent.

I tried the VAE to increase efficiency but it's very hard to train a good VAE. There's no good loss function for audio, and it's also hard to make diffusion work with that. I would leave that away for now if you don't what to do lots of experiments :)

0 replies

zaptrem · 2022-08-28T23:01:23Z

zaptrem
Aug 28, 2022
Author

it's very hard to train a good VAE

Could one just steal the pretrained VQVAEs from OpenAI's Jukebox? Or is that type not useful for efficiency improvements like that of Stable Diffusion?

0 replies

flavioschneider · 2022-08-28T23:09:24Z

flavioschneider
Aug 28, 2022
Maintainer

I'm not sure that would work since in order to add noise to the encoded input it needs to be in the range -1,1 with a mean of 0. Maybe if properly regularized

0 replies

zaptrem · 2022-08-28T23:16:50Z

zaptrem
Aug 28, 2022
Author

That makes sense. Is there any reason a pyramid of diffusers a-la Jukebox's transformer priors couldn't do a similar job? That was my plan once I got something resembling acceptable results out of this level. Also, is there a rule of thumb for when to end training? Or do people just wait until changes are no longer audible/visible?

0 replies

zaptrem · 2022-08-30T23:07:19Z

zaptrem
Aug 30, 2022
Author

I switched to the larger attention features version and am getting slightly more encouraging results: https://wandb.ai/zaptrem/diffusion-pop-4?workspace=user-zaptrem

I think I should keep scaling.

Is the learning rate falloff determined by number of epochs, or steps?

0 replies

flavioschneider · 2022-08-31T08:20:11Z

flavioschneider
Aug 31, 2022
Maintainer

That makes sense. Is there any reason a pyramid of diffusers a-la Jukebox's transformer priors couldn't do a similar job?

When you say a pyramid of diffusers do you mean like: a first diffusion model predicting a source at 12kHz, then a second upsampling that to 24kHz, and a third to 48kHz?

Also, is there a rule of thumb for when to end training? Or do people just wait until changes are no longer audible/visible?

There isn't. I've noticed that sometimes, even if the loss seems to converge, the quality continues to improve a bit after that. It's hard to find a rule that always applies, since there's no good metric for audio quality.

I switched to the larger attention features version and am getting slightly more encouraging results: https://wandb.ai/zaptrem/diffusion-pop-4?workspace=user-zaptrem

That's very interesting! (For some reason, the provided link seems to be dead)

Is the learning rate falloff determined by number of epochs, or steps?

I didn't add any LR scheduler, but other people I think use InverseLR, CosineAnnealingLR, or LambdaLR scheduling. Also, ideally you would keep a second model with EMA from which you do the sampling so that it's more stable. See for example the trainer in imagen-pytorch. It's something I might add to the trainer in the future.

Btw, I'm going to move this issue into the general discussion :)

17 replies

flavioschneider Sep 8, 2022
Maintainer

Try with 192 or 256 base channels, depending on how big you want to go. The epoch thing is very weird, do you observe a memory leak during training in the GPU VRAM?

Could you please paste here the parameters you are using as follows (so that it's easier for me and everyone else to keep track)

in_channels: ______
channels:  ______
patch_size:  ______
multipliers: ______
factors: ______
num_blocks: ______
attentions:  ______
attention_heads:  ______
attention_features:  ______
attention_multiplier:  ______

total_params: ______

zaptrem Sep 9, 2022
Author

Try with 192 or 256 base channels, depending on how big you want to go. The epoch thing is very weird, do you observe a memory leak during training in the GPU VRAM?

Could you please paste here the parameters you are using as follows (so that it's easier for me and everyone else to keep track)
in_channels: ______
channels:  ______
patch_size:  ______
multipliers: ______
factors: ______
num_blocks: ______
attentions:  ______
attention_heads:  ______
attention_features:  ______
attention_multiplier:  ______

total_params: ______

I just uploaded all of the configs + the latest checkpoints for the previous models I trained here (edit: fixed link): https://drive.google.com/drive/folders/12yE_eCEnNF_FYYysJqWEKiiU7dktDx3o?usp=sharing

Here are the WIP settings for the 6khz single-channel one I want to train now (I bumped all of the parameters I increased before, but haven't tried base channels yet. Is base channels like a multiplier the entire model (including attention)?):

  in_channels: 1
  channels: 128
  patch_size: 16
  resnet_groups: 8
  kernel_multiplier_downsample: 2
  kernel_sizes_init: [1, 3, 7]
  multipliers: [1, 2, 4, 8, 8, 8, 8]
  factors: [4, 4, 4, 2, 2, 2]
  num_blocks: [2, 4, 4, 4, 8, 8]
  attentions: [False, False, False, True, True, True]
  attention_heads: 8
  attention_features: 128
  attention_multiplier: 16

However, while updating and using mono: True got rid of the error I was running into with single-channel, I didn't save any VRAM in training vs stereo (I assumed the results would be the same as cutting sample length in half). Is that expected?

flavioschneider Sep 10, 2022
Maintainer

Yeah, reducing the sample length in half reduces by half the number of convolutions and attention tokens. Adding another channel at the beginning has almost no effect, since it's patched down and always expanded to channels channels.

zaptrem Sep 10, 2022
Author

Good to know, thanks. I'm now training with this set of parameters here:

  in_channels: 2  
  channels: 128
  patch_size: 16
  resnet_groups: 8
  kernel_multiplier_downsample: 2
  kernel_sizes_init: [1, 3, 7]
  multipliers: [1, 2, 4, 8, 8, 8, 8]
  factors: [4, 4, 4, 2, 2, 2]
  num_blocks: [2, 4, 4, 4, 8, 8]
  attentions: [False, False, False, True, True, True]
  attention_heads: 8
  attention_features: 128
  attention_multiplier: 16

It seems to be converging much more slowly than the previous models (potentially due to more parameters or me enabling batch accumulation in order to reach a real batch size of 48?) Does anything stick out as suboptimal that I can change before I get too deep into the training run?

Karma-Cat Apr 27, 2023

@flavioschneider Hello guys!
First and foremost, I would like to express my gratitude for your exceptional work and fascinating research. I have trained the model on 16kHz mono WAV files, using the following basic parameters for the DiffusionModel:

DiffusionModel(
net_t=UNetV0, // diffusion (U-Net V0 in this case)
in_channels=1,
channels=[8, 32, 64, 128, 256, 512, 512, 1024, 1024],
factors=[1, 4, 4, 4, 2, 2, 2, 2, 2],
items=[1, 2, 2, 2, 2, 2, 2, 4, 4],
attentions=[0, 0, 0, 0, 0, 1, 1, 1, 1],
attention_heads=8,
attention_features=64,
diffusion_t=VDiffusion,
sampler_t=VSampler
)

I would appreciate some clarification regarding the attention mechanisms. Specifically, are the attention_heads analogous to quantizers, and are the attention_features similar to the size of the vector quantization? Or is my understanding incorrect?

I'm trying to find ways to improve the quality of the synthesized audio(without improving quality of training data)

Thank you in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Scaling guide/suggested parameters? #6

{{title}}

Replies: 10 comments 17 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Question: Scaling guide/suggested parameters? #6

zaptrem Aug 26, 2022

Replies: 10 comments · 17 replies

flavioschneider Aug 26, 2022 Maintainer

zaptrem Aug 26, 2022 Author

flavioschneider Aug 26, 2022 Maintainer

zaptrem Aug 28, 2022 Author

flavioschneider Aug 28, 2022 Maintainer

zaptrem Aug 28, 2022 Author

flavioschneider Aug 28, 2022 Maintainer

zaptrem Aug 28, 2022 Author

zaptrem Aug 30, 2022 Author

flavioschneider Aug 31, 2022 Maintainer

flavioschneider Sep 8, 2022 Maintainer

zaptrem Sep 9, 2022 Author

flavioschneider Sep 10, 2022 Maintainer

zaptrem Sep 10, 2022 Author

Karma-Cat Apr 27, 2023

zaptrem
Aug 26, 2022

Replies: 10 comments 17 replies

flavioschneider
Aug 26, 2022
Maintainer

zaptrem
Aug 26, 2022
Author

flavioschneider
Aug 26, 2022
Maintainer

zaptrem
Aug 28, 2022
Author

flavioschneider
Aug 28, 2022
Maintainer

zaptrem
Aug 28, 2022
Author

flavioschneider
Aug 28, 2022
Maintainer

zaptrem
Aug 28, 2022
Author

zaptrem
Aug 30, 2022
Author

flavioschneider
Aug 31, 2022
Maintainer

flavioschneider Sep 8, 2022
Maintainer

zaptrem Sep 9, 2022
Author

flavioschneider Sep 10, 2022
Maintainer

zaptrem Sep 10, 2022
Author