- TensorFlow (1.12)
- Numpy
- Matplotlib
- Pillow (PIL)
- sklearn
Fashion MNIST is a dataset of Zalando's article images — consisting of 70,000 grayscale images in 10 categories. Each example is a 28x28 grayscale image, associated with a label from 10 classes. Fashion-MNIST
is intended to serve as a direct drop-in replacement for the original MNIST dataset — often used as the "Hello, World" of machine learning programs for computer vision. It shares the same image size and structure of training and testing splits. We will use 60,000 images to train the network and 10,000 images to evaluate how accurately the network learned to classify images.
To get started, we already provide two sample code files to get you familiar with TensorFlow and its high-level API (tf.keras
). The first example script (00_fashion_mnist.py
) uses a static graph to build, train, and test the model. The second example script (01_fashion_mnist.py
) uses TensorFlow's eager execution mode which uses dynamic graph. It's also recommended you go through the keras and eager execution tutorial.
Try running both scripts and see the difference between two modes. It will start printing the loss and accuracy. Go through the code and make sure you understand the different parts of it.
Q 0.1: Both scripts use the same neural network model, how many trainable parameters does each layer have?
Q 0.3: Why do the plots from two scripts look different? Why does the second script show smoother loss? Why are there three jumps in the training curves?
Now that you are familiar with both the static and dynamic graph modes in TensorFlow, you can use either mode for rest of the homework. Let's try to recognize some natural images. We start by modifying the code to read images from the PASCAL 2007 dataset. Following steps will guide you through the process.
We first need to download the image dataset and annotations. Use the following commands to setup the data, and lets say it is stored at location $DATA_DIR
.
# First, cd to a location where you want to store ~0.5GB of data.
wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtrainval_06-Nov-2007.tar
tar -xf VOCtrainval_06-Nov-2007.tar
# Also download the test data
wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtest_06-Nov-2007.tar && tar -xf VOCtest_06-Nov-2007.tar
cd VOCdevkit/VOC2007/
export DATA_DIR=$(pwd)
The first step is to write a data loader which loads this PASCAL data. Since there are only about 10K images in this dataset, we can simply load all the images into CPU memory, along with the labels. The important thing to note is that PASCAL can have multiple objects present in the same image. Hence, this is a multi-label classification problem, and will have to be tackled slightly differently.
We provide some starter code for this task in 02_pascal.py
. You need to fill in some of the functions, as outlined next.
Find the function definition for load_pascal
in util.py
. As the function docstring says, the function takes as input the $DATA_DIR
and the split (train
/val
/trainval
/test
), and outputs all the images, labels and weights from the dataset. For N
images in the split, the images should be np.ndarray
of shape NxHxWx3
, and labels/weights should be Nx20
. The labels should be 1s for each object that is present in the image, and weights should be 1 for each label in the image, except those labeled as ambiguous. All other values should be 0. For simplicity, resize all images to a canonical size (eg, 256x256px).
In the following tasks, we will use data in trainval
for training and test
for testing.
Hint: The dataset contain a ImageSets/Main/
folder, with files named <class_name>_<split_name>.txt
. Use those files to find images that are in the different splits of the data. Look at the README to understand the structure and labeling.
Since we are training a model from scratch on this small dataset, it is important to perform some basic data augmentation to avoid overfitting. Add random crops and left-right flips when training, and do a center crop when testing. As for natural images, another common practice is to substract the mean values of RGB images from ImageNet dataset. The mean values for RGB images are: [123.68, 116.78, 103.94]
.
Hint: Note that you can use functions such as tf.image.random_flip_left_right
, tf.random_crop
etc. These functions can be applied within tf.data.Dataset.map
.
Write the code for training and testing for multi-label classification in 02_pascal.py
. We will be using the same model from Fashion MNIST (bad idea, but let's give it a shot).
To evaluate the trained model, we will use a standard metric for multi-label evaluation - mean average precision (mAP). Please implement the code for evaluating the model with given dataset in function eval_dataset_map
in util.py
. You will need to make predictions on the given dataset with the model and call compute_ap
to get average precision.
TensorFlow ships with an awesome visualization tool called TensorBoard. It can be used to visualize training losses, network weights and other parameters. Add code in 02_pascal.py
to visualize the testing MAP and training loss in TensorBoard.
Now that you have implemented all the parts above, we can start training the model with PASCAL 2007 dataset.
Q 1.1 Show clear screenshots of the learning curves of testing MAP and training loss for 5 epochs (batch size=20, learning rate=0.001). Please evaluate your model to calculate the MAP on the testing dataset every 50 iterations.
As you might have seen, the performance of our simple CNN mode was pretty low for PASCAL. This is expected as PASCAL is much more complex than FASHION MNIST, and we need a much beefier model to handle it. Copy over your code from 02_pascal.py
to 03_pascal_caffenet.py
, and lets implement a deep CNN.
In this task we will be constructing a variant of the alexnet architecture, known as CaffeNet. If you are familiar with Caffe, a prototxt of the network is available here. A visualization of the network is available here
Here is the exact model we want to build. We use the following operator notation for the architecture:
- Convolution: A convolution with kernel size
k
, strides
, output channelsn
, paddingp
, is represented asconv(k, s, n, p)
. - Max Pooling: A max pool operation with kernel size
k
, strides
asmax_pool(k, s)
. - Fully connected: For
n
units,fully_connected(n)
.
ARCHITECTURE:
-> image
-> conv(11, 4, 96, 'VALID')
-> relu()
-> max_pool(3, 2)
-> conv(5, 1, 256, 'SAME')
-> relu()
-> max_pool(3, 2)
-> conv(3, 1, 384, 'SAME')
-> relu()
-> conv(3, 1, 384, 'SAME')
-> relu()
-> conv(3, 1, 256, 'SAME')
-> relu()
-> max_pool(3, 2)
-> flatten()
-> fully_connected(4096)
-> relu()
-> dropout(0.5)
-> fully_connected(4096)
-> relu()
-> dropout(0.5)
-> fully_connected(20)
Please modify your code to use the following hyperparameter settings.
- Change the optimizer to SGD + Momentum, with momentum of 0.9.
- Use an exponentially decaying learning rate schedule, that starts at 0.001, and decays by 0.5 every 5K iterations.
- Use batch size 20.
Please add code for saving the model periodically (save at least 30 checkpoints during training for Task 2). Please save the models for all the remaining scripts (Task 3 and Task 4). And for Task 3 and Task 4, you only need to save the model in the end of training.You will need these models later.
Q 2.1 Show clear screenshots of testing MAP and training loss for 60 epochs. Please evaluate your model to calculate the MAP on the testing dataset every 250 iterations.
Hopefully we all got much better accuracy with the deeper model! Since 2012, many other deeper architectures have been proposed, and VGG-16 is one of the popular ones. In this task, we attempt to further improve the performance with the "very deep" VGG-16 architecture. Copy over your code from 02_pascal.py
to 04_pascal_vgg_scratch.py
and modify the code.
Modify the network architecture from Task 2 to implement the VGG-16 architecture (refer to the original paper).
Add code to use tensorboard for visualizing a) Training loss, b) Learning rate, c) Histograms of gradients, d) Training images
Use the same hyperparameter settings from Task 2, and try to train the model.
Q 3.1 Add screenshots of training and testing loss, testing MAP curves, learning rate, histograms of gradients and examples of training images from TensorBoard.
As we have already seen, deep networks can sometimes be hard to optimize, while other times lead to heavy overfitting on small training sets. Many approaches have been proposed to counter this, eg, Krahenbuhl et al. (ICLR'16) and other works we have seen in un-/self-supervised learning. However, the most effective approach remains pre-training the network on large, well-labeled datasets such as ImageNet. While training on the full ImageNet data is beyond the scope of this assignment, people have already trained many popular/standard models and released them online. In this task, we will initialize the VGG model from the previous task with pre-trained ImageNet weights, and finetune the network for PASCAL classification.
Link for VGG-16 pretrained model in Keras:
https://github.com/fchollet/deep-learning-models/releases/download/v0.1/vgg16_weights_tf_dim_ordering_tf_kernels.h5
Copy over your code from 02_pascal.py
to 05_pascal_vgg_finetune.py
and modify the code.
Load the pre-trained weights upto fc7 layer, and initialize fc8 weights and biases from scratch. Then train the network as before. You may use funtions such as tf.keras.utils.get_file
, tf.keras.models.load_weights
. Since the pretrained model might use different names for the weights, you need to figure out how to load the weights correctly.
Q4.1: Use similar hyperparameter setup as in the scratch case, however, let the learning rate start from 0.0001, and decay by 0.5 every 1K iterations. Show the learning curves (training and testing loss, testing MAP) for 10 epochs. Please evaluate your model to calculate the MAP on the testing dataset every 60 iterations.
By now we should have a good idea of training networks from scratch or from pre-trained model, and the relative performance in either scenarios. Needless to say, the performance of these models is way stronger than previous non-deep architectures we used until 2012. However, final performance is not the only metric we care about. It is important to get some intuition of what these models are really learning. Lets try some standard techniques.
Extract and compare the conv1 filters from CaffeNet in Task 2, at different stages of the training. Show at least 3 filters.
Pick 10 images from PASCAL test set from different classes, and compute 4 nearest neighbors of those images over the test set. You should use and compare the following feature representations for the nearest neighbors:
- pool5 features from the CaffeNet (trained from scratch)
- fc7 features from the CaffeNet (trained from scratch)
- pool5 features from the VGG (finetuned from ImageNet)
- fc7 features from VGG (finetuned from ImageNet)
Show the 10 images you chose and their 4 nearest neighbors for each case.
We can also visualize how the feature representations specialize for different classes. Take 1000 random images from the test set of PASCAL, and extract caffenet (scratch) fc7
features from those images. Compute a 2D t-SNE projection of the features, and plot them with each feature color coded by the GT class of the corresponding image. If multiple objects are active in that image, compute the color as the "mean" color of the different classes active in that image. Legend the graph with the colors for each object class.
Show the per-class performance of your caffenet (scratch) and VGG-16 (finetuned) models. Try to explain, by observing examples from the dataset, why some classes are harder or easier than the others (consider the easiest and hardest class). Do some classes see large gains due to pre-training? Can you explain why that might happen?
Many techniques have been proposed in the literature to improve classification performance for deep networks. In this section, we try to use a recently proposed technique called mixup. The main idea is to augment the training set with linear combinations of images and labels. Read through the paper and modify your model to implement mixup. Report your performance, along with training/test curves, and comparison with baseline in the report.
Parts of the starter code are taken from official TensorFlow tutorials. Many thanks to the original authors!