release models

xingjianleng · Jul 19, 2021 · 912d577 · 912d577
1 parent 18ff408
commit 912d577
Show file tree

Hide file tree

Showing 2 changed files with 200 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -4,7 +4,147 @@ This is the codebase for [Diffusion Models Beat GANS on Image Synthesis](http://
 
 This repository is based on [openai/improved-diffusion](https://github.com/openai/improved-diffusion), with modifications for classifier conditioning and architecture improvements.
 
-# Usage
+# Download pre-trained models
+
+We have released checkpoints for the main models in the paper. Before using these models, please review the corresponding [model card](model-card.md) to understand the intended use and limitations of these models.
+
+Here are the download links for each model checkpoint:
+
+ * 64x64 classifier: [64x64_classifier.pt](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/64x64_classifier.pt)
+ * 64x64 diffusion: [64x64_diffusion.pt](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/64x64_diffusion.pt)
+ * 128x128 classifier: [128x128_classifier.pt](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/128x128_classifier.pt)
+ * 128x128 diffusion: [128x128_diffusion.pt](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/128x128_diffusion.pt)
+ * 256x256 classifier: [256x256_classifier.pt](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/256x256_classifier.pt)
+ * 256x256 diffusion: [256x256_diffusion.pt](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/256x256_diffusion.pt)
+ * 256x256 diffusion (not class conditional): [256x256_diffusion_uncond.pt](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/256x256_diffusion_uncond.pt)
+ * 512x512 classifier: [512x512_classifier.pt](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/512x512_classifier.pt)
+ * 512x512 diffusion: [512x512_diffusion.pt](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/512x512_diffusion.pt)
+ * 64x64 -&gt; 256x256 upsampler: [64_256_upsampler.pt](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/64_256_upsampler.pt)
+ * 128x128 -&gt; 512x512 upsampler: [128_512_upsampler.pt](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/128_512_upsampler.pt)
+ * LSUN bedroom: [lsun_bedroom.pt](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/lsun_bedroom.pt)
+ * LSUN cat: [lsun_cat.pt](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/lsun_cat.pt)
+ * LSUN horse: [lsun_horse.pt](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/lsun_horse.pt)
+ * LSUN horse (no dropout): [lsun_horse_nodropout.pt](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/lsun_horse_nodropout.pt)
+
+# Sampling from pre-trained models
+
+To sample from these models, you can use the `classifier_sample.py`, `image_sample.py`, and `super_res_sample.py` scripts.
+Here, we provide flags for sampling from all of these models.
+We assume that you have downloaded the relevant model checkpoints into a folder called `models/`.
+
+For these examples, we will generate 100 samples with batch size 4. Feel free to change these values.
+
+```
+SAMPLE_FLAGS="--batch_size 4 --num_samples 100 --timestep_respacing 250"
+```
+
+## Classifier guidance
+
+Note for these sampling runs that you can set `--classifier_scale 0` to sample from the base diffusion model.
+You may also use the `image_sample.py` script instead of `classifier_sample.py` in that case.
+
+ * 64x64 model:
+
+```
+MODEL_FLAGS="--attention_resolutions 32,16,8 --class_cond True --diffusion_steps 1000 --dropout 0.1 --image_size 64 --learn_sigma True --noise_schedule cosine --num_channels 192 --num_head_channels 64 --num_res_blocks 3 --resblock_updown True --use_new_attention_order True --use_fp16 True --use_scale_shift_norm True"
+python classifier_sample.py $MODEL_FLAGS --classifier_scale 1.0 --classifier_path models/64x64_classifier.pt --model_path models/64x64_diffusion.pt $SAMPLE_FLAGS
+```
+
+ * 128x128 model:
+
+```
+MODEL_FLAGS="--attention_resolutions 32,16,8 --class_cond True --diffusion_steps 1000 --image_size 128 --learn_sigma True --noise_schedule linear --num_channels 256 --num_heads 4 --num_res_blocks 2 --resblock_updown True --use_fp16 True --use_scale_shift_norm True"
+python classifier_sample.py $MODEL_FLAGS --classifier_scale 0.5 --classifier_path models/128x128_classifier.pt --model_path models/128x128_diffusion.pt $SAMPLE_FLAGS
+```
+
+ * 256x256 model:
+
+```
+MODEL_FLAGS="--attention_resolutions 32,16,8 --class_cond True --diffusion_steps 1000 --image_size 256 --learn_sigma True --noise_schedule linear --num_channels 256 --num_head_channels 64 --num_res_blocks 2 --resblock_updown True --use_fp16 True --use_scale_shift_norm True"
+python classifier_sample.py $MODEL_FLAGS --classifier_scale 1.0 --classifier_path models/256x256_classifier.pt --model_path models/256x256_diffusion.pt $SAMPLE_FLAGS
+```
+
+ * 256x256 model (unconditional):
+
+```
+MODEL_FLAGS="--attention_resolutions 32,16,8 --class_cond False --diffusion_steps 1000 --image_size 256 --learn_sigma True --noise_schedule linear --num_channels 256 --num_head_channels 64 --num_res_blocks 2 --resblock_updown True --use_fp16 True --use_scale_shift_norm True"
+python classifier_sample.py $MODEL_FLAGS --classifier_scale 10.0 --classifier_path models/256x256_classifier.pt --model_path models/256x256_diffusion.pt $SAMPLE_FLAGS
+```
+
+ * 512x512 model:
+
+```
+MODEL_FLAGS="--attention_resolutions 32,16,8 --class_cond True --diffusion_steps 1000 --image_size 512 --learn_sigma True --noise_schedule linear --num_channels 256 --num_head_channels 64 --num_res_blocks 2 --resblock_updown True --use_fp16 False --use_scale_shift_norm True"
+python classifier_sample.py $MODEL_FLAGS --classifier_scale 4.0 --classifier_path models/512x512_classifier.pt --model_path models/512x512_diffusion.pt $SAMPLE_FLAGS
+```
+
+## Upsampling
+
+For these runs, we assume you have some base samples in a file `64_samples.npz` or `128_samples.npz` for the two respective models.
+
+ * 64 -&gt; 256:
+
+```
+MODEL_FLAGS="--attention_resolutions 32,16,8 --class_cond True --diffusion_steps 1000 --large_size 256  --small_size 64 --learn_sigma True --noise_schedule linear --num_channels 192 --num_heads 4 --num_res_blocks 2 --resblock_updown True --use_fp16 True --use_scale_shift_norm True"
+python super_res_sample.py $MODEL_FLAGS --model_path models/64_256_upsampler.pt --base_samples 64_samples.npz $SAMPLE_FLAGS
+```
+
+ * 128 -&gt; 512:
+
+```
+MODEL_FLAGS="--attention_resolutions 32,16 --class_cond True --diffusion_steps 1000 --large_size 512 --small_size 128 --learn_sigma True --noise_schedule linear --num_channels 192 --num_head_channels 64 --num_res_blocks 2 --resblock_updown True --use_fp16 True --use_scale_shift_norm True"
+python super_res_sample.py $MODEL_FLAGS --model_path models/128_512_upsampler.pt $SAMPLE_FLAGS --base_samples 128_samples.npz
+```
+
+## LSUN models
+
+These models are class-unconditional and correspond to a single LSUN class. Here, we show how to sample from `lsun_bedroom.pt`, but the other two LSUN checkpoints should work as well:
+
+```
+MODEL_FLAGS="--attention_resolutions 32,16,8 --class_cond False --diffusion_steps 1000 --dropout 0.1 --image_size 256 --learn_sigma True --noise_schedule linear --num_channels 256 --num_head_channels 64 --num_res_blocks 2 --resblock_updown True --use_fp16 True --use_scale_shift_norm True"
+python image_sample.py $MODEL_FLAGS --model_path models/lsun_bedroom.pt $SAMPLE_FLAGS
+```
+
+You can sample from `lsun_horse_nodropout.pt` by changing the dropout flag:
+
+```
+MODEL_FLAGS="--attention_resolutions 32,16,8 --class_cond False --diffusion_steps 1000 --dropout 0.0 --image_size 256 --learn_sigma True --noise_schedule linear --num_channels 256 --num_head_channels 64 --num_res_blocks 2 --resblock_updown True --use_fp16 True --use_scale_shift_norm True"
+python image_sample.py $MODEL_FLAGS --model_path models/lsun_horse_nodropout.pt $SAMPLE_FLAGS
+```
+
+Note that for these models, the best samples result from using 1000 timesteps:
+
+```
+SAMPLE_FLAGS="--batch_size 4 --num_samples 100 --timestep_respacing 1000"
+```
+
+# Results
+
+This table summarizes our ImageNet results for pure guided diffusion models:
+
+| Dataset          | FID  | Precision | Recall |
+|------------------|------|-----------|--------|
+| ImageNet 64x64   | 2.07 | 0.74      | 0.63   |
+| ImageNet 128x128 | 2.97 | 0.78      | 0.59   |
+| ImageNet 256x256 | 4.59 | 0.82      | 0.52   |
+| ImageNet 512x512 | 7.72 | 0.87      | 0.42   |
+
+This table shows the best results for high resolutions when using upsampling and guidance together:
+
+| Dataset          | FID  | Precision | Recall |
+|------------------|------|-----------|--------|
+| ImageNet 256x256 | 3.94 | 0.83      | 0.53   |
+| ImageNet 512x512 | 3.85 | 0.84      | 0.53   |
+
+Finally, here are the unguided results on individual LSUN classes:
+
+| Dataset      | FID  | Precision | Recall |
+|--------------|------|-----------|--------|
+| LSUN Bedroom | 1.90 | 0.66      | 0.51   |
+| LSUN Cat     | 5.57 | 0.63      | 0.52   |
+| LSUN Horse   | 2.57 | 0.71      | 0.55   |
+
+# Training models
 
 Training diffusion models is described in the [parent repository](https://github.com/openai/improved-diffusion). Training a classifier is similar. We assume you have put training hyperparameters into a `TRAIN_FLAGS` variable, and classifier hyperparameters into a `CLASSIFIER_FLAGS` variable. Then you can run:
 

diff --git a/model-card.md b/model-card.md
@@ -0,0 +1,59 @@
+# Overview
+
+These are diffusion models and noised image classifiers described in the paper [Diffusion Models Beat GANs on Image Synthesis](https://arxiv.org/abs/2105.05233).
+Included in this release are the following models:
+
+ * Noisy ImageNet classifiers at resolutions 64x64, 128x128, 256x256, 512x512
+ * A class-unconditional ImageNet diffusion model at resolution 256x256
+ * Class conditional ImageNet diffusion models at 64x64, 128x128, 256x256, 512x512 resolutions
+ * Class-conditional ImageNet upsampling diffusion models: 64x64->256x256, 128x128->512x512
+ * Diffusion models trained on three LSUN classes at 256x256 resolution: cat, horse, bedroom
+
+# Datasets
+
+All of the models we are releasing were either trained on the [ILSVRC 2012 subset of ImageNet](http://www.image-net.org/challenges/LSVRC/2012/) or on single classes of [LSUN](https://arxiv.org/abs/1506.03365).
+Here, we describe characteristics of these datasets which impact model behavior:
+
+**LSUN**: This dataset was collected in 2015 using a combination of human labeling (from Amazon Mechanical Turk) and automated data labeling.
+ * Each of the three classes we consider contain over a million images.
+ * The dataset creators found that the label accuracy was roughly 90% across the entire LSUN dataset when measured by trained experts.
+ * Images are scraped from the internet, and LSUN cat images in particular tend to often follow a “meme” format.
+ * We found that there are occasionally humans in these photos, including faces, especially within the cat class.  
+
+**ILSVRC 2012 subset of ImageNet**: This dataset was curated in 2012 and consists of roughly one million images, each belonging to one of 1000 classes.
+ * A large portion of the classes in this dataset are animals, plants, and other naturally-occurring objects.
+ * Many images contain humans, although usually these humans aren’t reflected by the class label (e.g. the class “Tench, tinca tinca” contains many photos of people holding fish).
+
+# Performance
+
+These models are intended to generate samples consistent with their training distributions.
+This has been measured in terms of FID, Precision, and Recall.
+These metrics all rely on the representations of a [pre-trained Inception-V3 model](https://arxiv.org/abs/1512.00567),
+which was trained on ImageNet, and so is likely to focus more on the ImageNet classes (such as animals) than on other visual features (such as human faces).
+
+Qualitatively, the samples produced by these models often look highly realistic, especially when a diffusion model is combined with a noisy classifier.
+
+# Intended Use
+
+These models are intended to be used for research purposes only.
+In particular, they can be used as a baseline for generative modeling research, or as a starting point to build off of for such research.
+
+These models are not intended to be commercially deployed.
+Additionally, they are not intended to be used to create propaganda or offensive imagery.
+
+Before releasing these models, we probed their ability to ease the creation of targeted imagery, since doing so could be potentially harmful.
+We did this either by fine-tuning our ImageNet models on a target LSUN class, or through classifier guidance with publicly available [CLIP models](https://github.com/openai/CLIP).
+ * To probe fine-tuning capabilities, we restricted our compute budget to roughly $100 and tried both standard fine-tuning,
+and a diffusion-specific approach where we train a specialized classifier for the LSUN class. The resulting FIDs were significantly worse than publicly available GAN models, indicating that fine-tuning an ImageNet diffusion model does not significantly lower the cost of image generation.
+ * To probe guidance with CLIP, we tried two approaches for using pre-trained CLIP models for classifier guidance. Either we fed the noised image to CLIP directly and used its gradients, or we fed the diffusion model's denoised prediction to the CLIP model and differentiated through the whole process. In both cases, we found that it was difficult to recover information from the CLIP model, indicating that these diffusion models are unlikely to make it significantly easier to extract knowledge from CLIP compared to existing GAN models.
+
+# Limitations
+
+These models sometimes produce highly unrealistic outputs, particularly when generating images containing human faces.
+This may stem from ImageNet's emphasis on non-human objects.
+
+While classifier guidance can improve sample quality, it reduces diversity, resulting in some modes of the data distribution being underrepresented.
+This can potentially amplify existing biases in the training dataset such as gender and racial biases.
+
+Because ImageNet and LSUN contain images from the internet, they include photos of real people, and the model may have memorized some of the information contained in these photos.
+However, these images are already publicly available, and existing generative models trained on ImageNet have not demonstrated significant leakage of this information.