Using Pre-Trained Models
Using Pre-Trained Models
Introduction
Neural networks are a different breed of models compared to the supervised machine learning
algorithms. Why do I say so? There are multiple reasons for that, but the most prominent is the cost
of running algorithms on the hardware.
In today’s world, RAM on a machine is cheap and is available in plenty. You need hundreds of GBs
of RAM to run a super complex supervised machine learning problem – it can be yours for a little
investment / rent. On the other hand, access to GPUs is not that cheap. You need access to hundred
GB VRAM on GPUs – it won’t be straight forward and would involve significant costs.
Now, that may change in future. But for now, it means that we have to be smarter about the way we
use our resources in solving Deep Learning problems. Especially so, when we try to solve complex
real life problems on areas like image and voice recognition. Once you have a few hidden layers in
your model, adding another layer of hidden layer would need immense resources.
Thankfully, there is something called “Transfer Learning” which enables us to use pre-trained
models from other people by making small changes. In this article, I am going to tell how we can
use pre-trained models to accelerate our solutions.
Note – This article assumes basic familiarity with Neural networks and deep learning. If you are
new to deep learning, I would strongly recommend that you read the following articles first:
1. What is deep learning and why is it getting so much attention?
2. Deep Learning vs. Machine Learning – the essential differences you need to know!
3. 25 Must Know Terms & concepts for Beginners in Deep Learning
4. Why are GPUs necessary for training Deep Learning models?
Table of Contents
1. What is transfer learning?
2. What is a Pre-trained Model?
3. Why would we use pre-trained models? – A real life example
4. How can I use pre-trained models?
• Extract Features
• Fine tune the model
5. Ways to fine tune your model
6. Use the pre-trained model for identifying digits
• Retraining the output dense layers only
• Freeze the weights of first few layers
Keeping in mind this analogy, we compare this to neural network. A neural network is trained on a
data. This network gains knowledge from this data, which is compiled as “weights” of the network.
These weights can be extracted and then transferred to any other neural network. Instead of training
the other neural network from scratch, we “transfer” the learned features.
Now, let us reflect on the importance of transfer learning by relating to our evolution. And what
better way than to use transfer learning for this! So I am picking on a concept touched on by Tim
Urban from one of his recent articles on waitbutwhy.com
Tim explains that before language was invented, every generation of humans had to re-invent the
knowledge for themselves and this is how knowledge growth was happening from one generation to
other:
Then, we invented language! A way to transfer learning from one generation to another and this is
what happened over same time frame:
Isn’t it phenomenal and super empowering? So, transfer learning by passing on weights is
equivalent of language used to disseminate knowledge over generations in human evolution.
What is a Pre-trained Model?
Simply put, a pre-trained model is a model created by some one else to solve a similar problem.
Instead of building a model from scratch to solve a similar problem, you use the model trained on
other problem as a starting point.
For example, if you want to build a self learning car. You can spend years to build a decent image
recognition algorithm from scratch or you can take inception model (a pre-trained model) from
Google which was built on ImageNet data to identify images in those pictures.
A pre-trained model may not be 100% accurate in your application, but it saves huge efforts
required to re-invent the wheel. Let me show this to you with a recent example.
I used 3 convolutional blocks with each block following the below architecture-
1. 32 filters of size 5 X 5
2. Activation function – relu
3. Max pooling layer of size 4 X 4
The result obtained after the final convolutional block was flattened into a size [256] and passed
into a single hidden layer of with 64 neurons. The output of the hidden layer was passed onto the
output layer after a drop out rate of 0.5.
The result obtained with the above architecture is summarized below-
Epoch 10/10
50/50 [==============================] – 21s – loss: 13.5733 – acc: 0.1575
Though my accuracy increased in comparison to the MLP output, it also increased the time taken to
run a single epoch – 21 seconds.
But the major point to note was that the majority class in the dataset was around 17.6%. So, even if
we had predicted the class of every image in the train dataset to be the majority class, we would
have performed better than MLP and CNN respectively. Addition of more convolutional blocks
substantially increased my training time. This led me to switch onto using pre-trained models where
I would not have to train my entire architecture but only a few layers.
So, I used VGG16 model which is pre-trained on the ImageNet dataset and provided in the keras
library for use. Below is the architecture of the VGG16 model which I used.
The only change that I made to the VGG16 existing architecture is changing the softmax layer with
1000 outputs to 16 categories suitable for our problem and re-training the dense layer.
This architecture gave me an accuracy of 70% much better than MLP and CNN. Also, the biggest
benefit of using the VGG16 pre-trained model was almost negligible time to train the dense layer
with greater accuracy.
So, I moved forward with this approach of using a pre-trained model and the next step was to fine
tune my VGG16 model to suit this problem.
How can I use Pre-trained Models?
What is our objective when we train a neural network? We wish to identify the correct weights for
the network by multiple forward and backward iterations. By using pre-trained models which have
been previously trained on large datasets, we can directly use the weights and architecture obtained
and apply the learning on our problem statement. This is known as transfer learning. We “transfer
the learning” of the pre-trained model to our specific problem statement.
You should be very careful while choosing what pre-trained model you should use in your case. If
the problem statement we have at hand is very different from the one on which the pre-trained
model was trained – the prediction we would get would be very inaccurate. For example, a model
previously trained for speech recognition would work horribly if we try to use it to identify objects
using it.
We are lucky that many pre-trained architectures are directly available for us in the Keras library.
Imagenet data set has been widely used to build various architectures since it is large enough (1.2M
images) to create a generalized model. The problem statement is to train a model that can correctly
classify the images into 1,000 separate object categories. These 1,000 image categories represent
object classes that we come across in our day-to-day lives, such as species of dogs, cats, various
household objects, vehicle types etc.
These pre-trained networks demonstrate a strong ability to generalize to images outside the
ImageNet dataset via transfer learning. We make modifications in the pre-existing model by fine-
tuning the model. Since we assume that the pre-trained network has been trained quite well, we
would not want to modify the weights too soon and too much. While modifying we generally use a
learning rate smaller than the one used for initially training the model.
temp_img=image.load_img(train_path+train['filename']
[i],target_size=(224,224))
temp_img=image.img_to_array(temp_img)
train_img.append(temp_img)
#converting train images to array and applying mean subtraction
processing
train_img=np.array(train_img)
train_img=preprocess_input(train_img)
# applying the same procedure with the test dataset
test_img=[]
for i in range(len(test)):
temp_img=image.load_img(test_path+test['filename']
[i],target_size=(224,224))
temp_img=image.img_to_array(temp_img)
test_img.append(temp_img)
test_img=np.array(test_img)
test_img=preprocess_input(test_img)
# loading VGG16 model weights
model = VGG16(weights='imagenet', include_top=False)
# Extracting features from the train dataset using the VGG16 pre-
trained model
features_train=model.predict(train_img)
# Extracting features from the train dataset using the VGG16 pre-
trained model
features_test=model.predict(test_img)
# flattening the layers to conform to MLP input
train_x=features_train.reshape(49000,25088)
# converting target variable to array
train_y=np.asarray(train['label'])
# performing one-hot encoding for the target variable
train_y=pd.get_dummies(train_y)
train_y=np.array(train_y)
# creating training and validation set
from sklearn.model_selection import train_test_split
X_train, X_valid, Y_train,
Y_valid=train_test_split(train_x,train_y,test_size=0.3,
random_state=42)
# creating a mlp model
from keras.layers import Dense, Activation
model=Sequential()
model.add(Dense(1000, input_dim=25088,
activation='relu',kernel_initializer='uniform'))
keras.layers.core.Dropout(0.3, noise_shape=None, seed=None)
model.add(Dense(500,input_dim=1000,activation='sigmoid'))
keras.layers.core.Dropout(0.4, noise_shape=None, seed=None)
model.add(Dense(150,input_dim=500,activation='sigmoid'))
keras.layers.core.Dropout(0.2, noise_shape=None, seed=None)
model.add(Dense(units=10))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer="adam",
metrics=['accuracy'])
2. Freeze the weights of first few layers – Here what we do is we freeze the weights of the first 8
layers of the vgg16 network, while we retrain the subsequent layers. This is because the first few
layers capture universal features like curves and edges that are also relevant to our new problem.
We want to keep those weights intact and we will get the network to focus on learning dataset-
specific features in the subsequent layers.
Code for freezing the weights of first few layers.
from keras.models import Sequential
from scipy.misc import imread
get_ipython().magic('matplotlib inline')
import matplotlib.pyplot as plt
import numpy as np
import keras
from keras.layers import Dense
import pandas as pd
from keras.applications.vgg16 import VGG16
from keras.preprocessing import image
from keras.applications.vgg16 import preprocess_input
import numpy as np
from keras.applications.vgg16 import decode_predictions
from keras.utils.np_utils import to_categorical
from sklearn.preprocessing import LabelEncoder
from keras.models import Sequential
from keras.optimizers import SGD
from keras.layers import Input, Dense, Convolution2D,
MaxPooling2D, AveragePooling2D, ZeroPadding2D, Dropout, Flatten,
merge, Reshape, Activation
from sklearn.metrics import log_loss
train=pd.read_csv("R/Data/Train/train.csv")
test=pd.read_csv("R/Data/test.csv")
train_path="R/Data/Train/Images/train/"
test_path="R/Data/Train/Images/test/"
from scipy.misc import imresize
train_img=[]
for i in range(len(train)):
temp_img=image.load_img(train_path+train['filename']
[i],target_size=(224,224))
temp_img=image.img_to_array(temp_img)
train_img.append(temp_img)
train_img=np.array(train_img)
train_img=preprocess_input(train_img)
test_img=[]
for i in range(len(test)):
temp_img=image.load_img(test_path+test['filename']
[i],target_size=(224,224))
temp_img=image.img_to_array(temp_img)
test_img.append(temp_img)
test_img=np.array(test_img)
test_img=preprocess_input(test_img)
from keras.models import Model
def vgg16_model(img_rows, img_cols, channel=1, num_classes=None):
model = VGG16(weights='imagenet', include_top=True)
model.layers.pop()
model.outputs = [model.layers[-1].output]
model.layers[-1].outbound_nodes = []
x=Dense(num_classes, activation='softmax')(model.output)
model=Model(model.input,x)
#To set the first 8 layers to non-trainable (weights will not be
updated)